Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.
I tried my way. Say, I have a dataframe as below,
>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2|null|
|null| 3|null|
| 5|null|null|
+----+----+----+
>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| 2| 0|
+----+----+----+
>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
|null| 3|
| 5|null|
+----+----+