Drop if all entries in a spark dataframe's specific column is null

后端 未结 8 1173
轮回少年
轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

8条回答
  •  臣服心动
    2021-01-13 19:37

    I tried my way. Say, I have a dataframe as below,

    >>> df.show()
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   1|   2|null|
    |null|   3|null|
    |   5|null|null|
    +----+----+----+
    
    >>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
    >>> df1.show()
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   2|   2|   0|
    +----+----+----+
    
    >>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
    >>> df = df.select(*nonNull_cols)
    >>> df.show()
    +----+----+
    |col1|col2|
    +----+----+
    |   1|   2|
    |null|   3|
    |   5|null|
    +----+----+
    

提交回复
热议问题