Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.
Just picking up pieces from the answers above, wrote my own solution for my use case.
What I essentially was trying to do is remove all columns from my pyspark dataframe which had 100% null values.
# identify and remove all columns having 100% null values
df_summary_count = your_df.summary("count")
null_cols = [c for c in df_summary_count .columns if df_summary_count.select(c).first()[c] == '0']
filtered_df = df_summary_count .drop(*null_cols)
This is a robust solution that takes into consideration all possible combinations of nulls that could be in a column. First, all null columns are found and then they are dropped. It looks lengthy and cumbersome, but in fact this is a robust solution. Only one loop is used for the finding of the null columns and no memory intensive function such as collect() is applied, which should make this solution fast and efficient.
rows = [(None, 18, None, None),
(1, None, None, None),
(1, 9, 4.0, None),
(None, 0, 0., None)]
schema = "a: int, b: int, c: float, d:int"
df = spark.createDataFrame(data=rows, schema=schema)
def get_null_column_names(df):
column_names = []
for col_name in df.columns:
min_ = df.select(F.min(col_name)).first()[0]
max_ = df.select(F.max(col_name)).first()[0]
if min_ is None and max_ is None:
column_names.append(col_name)
return column_names
null_columns = get_null_column_names(df)
def drop_column(null_columns, df):
for column_ in null_columns:
df = df.drop(column_)
return df
df = drop_column(null_columns, df)
df.show()
Output:
Or just
from pyspark.sql.functions import col
for c in df.columns:
if df.filter(col(c).isNotNull()).count() == 0:
df = df.drop(c)
I tried my way. Say, I have a dataframe as below,
>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2|null|
|null| 3|null|
| 5|null|null|
+----+----+----+
>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| 2| 0|
+----+----+----+
>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
|null| 3|
| 5|null|
+----+----+
One of the indirect way to do so is
import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
sdf = sdf.drop(col)
Update:
The above code drops columns with all nan. If you are looking for all nulls then
import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
sdf = sdf.drop(col)
Will update my answer if I find some optimal way :-)
This is a function I have in my pipeline to remove null columns. Hope it helps!
# Function to drop the empty columns of a DF
def dropNullColumns(df):
# A set of all the null values you can encounter
null_set = {"none", "null" , "nan"}
# Iterate over each column in the DF
for col in df.columns:
# Get the distinct values of the column
unique_val = df.select(col).distinct().collect()[0][0]
# See whether the unique value is only none/nan or null
if str(unique_val).lower() in null_set:
print("Dropping " + col + " because of all null values.")
df = df.drop(col)
return(df)