Drop if all entries in a spark dataframe's specific column is null

后端 未结 8 1175
轮回少年
轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

8条回答
  •  悲&欢浪女
    2021-01-13 19:45

    This is a function I have in my pipeline to remove null columns. Hope it helps!

    # Function to drop the empty columns of a DF
    def dropNullColumns(df):
        # A set of all the null values you can encounter
        null_set = {"none", "null" , "nan"}
        # Iterate over each column in the DF
        for col in df.columns:
            # Get the distinct values of the column
            unique_val = df.select(col).distinct().collect()[0][0]
            # See whether the unique value is only none/nan or null
            if str(unique_val).lower() in null_set:
                print("Dropping " + col + " because of all null values.")
                df = df.drop(col)
        return(df)
    

提交回复
热议问题