Drop if all entries in a spark dataframe's specific column is null

后端 未结 8 1176
轮回少年
轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

相关标签:
8条回答
  • 2021-01-13 19:49

    for me it worked in a bit different way than @Suresh answer:

    nonNull_cols = [c for c in original_df.columns if original_df.filter(func.col(c).isNotNull()).count() > 0]
    new_df = original_df.select(*nonNull_cols)
    
    0 讨论(0)
  • 2021-01-13 19:53

    Here's a much more efficient solution that doesn't involve looping over the columns. It is much faster when you have many columns. I tested the other methods here on a dataframe with 800 columns, which took 17 mins to run. The following method takes only 1 min in my tests on the same dataset.

    def drop_fully_null_columns(df, but_keep_these=[]):
        """Drops DataFrame columns that are fully null
        (i.e. the maximum value is null)
    
        Arguments:
            df {spark DataFrame} -- spark dataframe
            but_keep_these {list} -- list of columns to keep without checking for nulls
    
        Returns:
            spark DataFrame -- dataframe with fully null columns removed
        """
    
        # skip checking some columns
        cols_to_check = [col for col in df.columns if col not in but_keep_these]
        if len(cols_to_check) > 0:
            # drop columns for which the max is None
            rows_with_data = df.select(*cols_to_check).groupby().agg(*[F.max(c).alias(c) for c in cols_to_check]).take(1)[0]
            cols_to_drop = [c for c, const in rows_with_data.asDict().items() if const == None]
            new_df = df.drop(*cols_to_drop)
    
            return new_df
        else:
            return df
    
    0 讨论(0)
提交回复
热议问题