Drop if all entries in a spark dataframe's specific column is null

后端未结

关注

 8  1175

轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

8条回答

悲&欢浪女 (楼主)

2021-01-13 19:45

This is a function I have in my pipeline to remove null columns. Hope it helps!

# Function to drop the empty columns of a DF
def dropNullColumns(df):
    # A set of all the null values you can encounter
    null_set = {"none", "null" , "nan"}
    # Iterate over each column in the DF
    for col in df.columns:
        # Get the distinct values of the column
        unique_val = df.select(col).distinct().collect()[0][0]
        # See whether the unique value is only none/nan or null
        if str(unique_val).lower() in null_set:
            print("Dropping " + col + " because of all null values.")
            df = df.drop(col)
    return(df)

0 讨论(0)

查看其它8个回答