How to exclude multiple columns in Spark dataframe in Python

后端 未结 3 1401
一整个雨季
一整个雨季 2020-12-03 00:46

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

相关标签:
3条回答
  • 2020-12-03 01:30

    The right way to do this is:

    df.drop(*['col1', 'col2', 'col3'])

    The * needs to come outside of the brackets if there are multiple columns to drop.

    0 讨论(0)
  • 2020-12-03 01:34

    Simply with select:

    df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])
    

    or if you really want to use drop then reduce should do the trick:

    from functools import reduce
    from pyspark.sql import DataFrame
    
    reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)
    

    Note:

    (difference in execution time):

    There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

    There is a difference however when we analyze driver-side code:

    • the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
    • the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
    • finally comprehensions are significantly faster in Python than methods like map or reduce
    • Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.
    0 讨论(0)
  • 2020-12-03 01:44

    In PySpark 2.1.0 method drop supports multiple columns:

    PySpark 2.0.2:

    DataFrame.drop(col)
    

    PySpark 2.1.0:

    DataFrame.drop(*cols)
    

    Example:

    df.drop('col1', 'col2')
    
    0 讨论(0)
提交回复
热议问题