Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.
This is a function I have in my pipeline to remove null columns. Hope it helps!
# Function to drop the empty columns of a DF
def dropNullColumns(df):
# A set of all the null values you can encounter
null_set = {"none", "null" , "nan"}
# Iterate over each column in the DF
for col in df.columns:
# Get the distinct values of the column
unique_val = df.select(col).distinct().collect()[0][0]
# See whether the unique value is only none/nan or null
if str(unique_val).lower() in null_set:
print("Dropping " + col + " because of all null values.")
df = df.drop(col)
return(df)