Drop if all entries in a spark dataframe's specific column is null

后端未结

关注

 8  1173

轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

8条回答

臣服心动 (楼主)

2021-01-13 19:37

I tried my way. Say, I have a dataframe as below,

>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|null|
|null|   3|null|
|   5|null|null|
+----+----+----+

>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   2|   0|
+----+----+----+

>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|null|   3|
|   5|null|
+----+----+

0 讨论(0)

查看其它8个回答