pyspark: drop columns that have same values in all rows

前端 未结 2 1670
北恋
北恋 2021-01-19 03:45

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?

So I have a pyspark dataframe, and I want to drop the columns

相关标签:
2条回答
  • 2021-01-19 04:30

    You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.

    # apply countDistinct on each column
    col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
    
    # select the cols with count=1 in an array
    cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
    
    # drop the selected column
    df.drop(*cols_to_drop).show()
    
    0 讨论(0)
  • 2021-01-19 04:30

    You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.

    Creating the DataFrame

    from pyspark.sql.functions import approx_count_distinct
    myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
    df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
    df.show()
    +------+------+------+------+
    |value1|value2|value3|value4|
    +------+------+------+------+
    |     1|     2|     2|     0|
    |     2|     2|     2|     0|
    |     3|     2|     2|     0|
    |     4|     2|     2|     0|
    |     3|     1|     2|     0|
    +------+------+------+------+
    

    Couting number of distinct elements and converting it into dictionary.

    count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
    count_distinct_df.show()
    +------+------+------+------+
    |value1|value2|value3|value4|
    +------+------+------+------+
    |     4|     2|     1|     1|
    +------+------+------+------+
    dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
    dict_of_columns
        {'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
    
    #Storing those keys in the list which have just 1 distinct key.
    distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
    distinct_columns
        ['value3', 'value4']
    

    Drop the columns having distinct values

    df=df.drop(*distinct_columns)
    df.show()
    +------+------+
    |value1|value2|
    +------+------+
    |     1|     2|
    |     2|     2|
    |     3|     2|
    |     4|     2|
    |     3|     1|
    +------+------+
    
    0 讨论(0)
提交回复
热议问题