pyspark: drop columns that have same values in all rows

前端 未结 2 1671
北恋
北恋 2021-01-19 03:45

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?

So I have a pyspark dataframe, and I want to drop the columns

2条回答
  •  臣服心动
    2021-01-19 04:30

    You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.

    # apply countDistinct on each column
    col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
    
    # select the cols with count=1 in an array
    cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
    
    # drop the selected column
    df.drop(*cols_to_drop).show()
    

提交回复
热议问题