pyspark: drop columns that have same values in all rows

前端未结

关注

 2  1666

北恋 2021-01-19 03:45

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?

So I have a pyspark dataframe, and I want to drop the columns

2条回答

野的像风 (楼主)

2021-01-19 04:30

You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.

Creating the DataFrame

from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
|     1|     2|     2|     0|
|     2|     2|     2|     0|
|     3|     2|     2|     0|
|     4|     2|     2|     0|
|     3|     1|     2|     0|
+------+------+------+------+

Couting number of distinct elements and converting it into dictionary.

count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
|     4|     2|     1|     1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
    {'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}

#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
    ['value3', 'value4']

Drop the columns having distinct values

df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
|     1|     2|
|     2|     2|
|     3|     2|
|     4|     2|
|     3|     1|
+------+------+

0 讨论(0)

查看其它2个回答