Drop if all entries in a spark dataframe's specific column is null

后端 未结 8 1174
轮回少年
轮回少年 2021-01-13 19:11

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

相关标签:
8条回答
  • 2021-01-13 19:32

    Just picking up pieces from the answers above, wrote my own solution for my use case.

    What I essentially was trying to do is remove all columns from my pyspark dataframe which had 100% null values.

    # identify and remove all columns having 100% null values
    df_summary_count = your_df.summary("count")
    null_cols = [c for c in df_summary_count .columns if df_summary_count.select(c).first()[c] == '0']
    filtered_df = df_summary_count .drop(*null_cols)
    
    0 讨论(0)
  • 2021-01-13 19:33

    This is a robust solution that takes into consideration all possible combinations of nulls that could be in a column. First, all null columns are found and then they are dropped. It looks lengthy and cumbersome, but in fact this is a robust solution. Only one loop is used for the finding of the null columns and no memory intensive function such as collect() is applied, which should make this solution fast and efficient.

    rows = [(None, 18, None, None),
                (1, None, None, None),
                (1, 9, 4.0, None),
                (None, 0, 0., None)]
    
    schema = "a: int, b: int, c: float, d:int"
    df = spark.createDataFrame(data=rows, schema=schema)
    
    def get_null_column_names(df):
        column_names = []
    
        for col_name in df.columns:
    
            min_ = df.select(F.min(col_name)).first()[0]
            max_ = df.select(F.max(col_name)).first()[0]
    
            if min_ is None and max_ is None:
                column_names.append(col_name)
    
        return column_names
    
    null_columns = get_null_column_names(df)
    
    def drop_column(null_columns, df):
      for column_ in null_columns:
        df = df.drop(column_)
        return df
    
    df = drop_column(null_columns, df)
    df.show()
    

    Output:

    0 讨论(0)
  • 2021-01-13 19:35

    Or just

    from pyspark.sql.functions import col
    
    for c in df.columns:
        if df.filter(col(c).isNotNull()).count() == 0:
          df = df.drop(c)
    
    0 讨论(0)
  • 2021-01-13 19:37

    I tried my way. Say, I have a dataframe as below,

    >>> df.show()
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   1|   2|null|
    |null|   3|null|
    |   5|null|null|
    +----+----+----+
    
    >>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
    >>> df1.show()
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   2|   2|   0|
    +----+----+----+
    
    >>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
    >>> df = df.select(*nonNull_cols)
    >>> df.show()
    +----+----+
    |col1|col2|
    +----+----+
    |   1|   2|
    |null|   3|
    |   5|null|
    +----+----+
    
    0 讨论(0)
  • 2021-01-13 19:43

    One of the indirect way to do so is

    import pyspark.sql.functions as func
    
    for col in sdf.columns:
    if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
        sdf = sdf.drop(col) 
    

    Update:
    The above code drops columns with all nan. If you are looking for all nulls then

    import pyspark.sql.functions as func
    
    for col in sdf.columns:
    if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
        sdf = sdf.drop(col)
    

    Will update my answer if I find some optimal way :-)

    0 讨论(0)
  • 2021-01-13 19:45

    This is a function I have in my pipeline to remove null columns. Hope it helps!

    # Function to drop the empty columns of a DF
    def dropNullColumns(df):
        # A set of all the null values you can encounter
        null_set = {"none", "null" , "nan"}
        # Iterate over each column in the DF
        for col in df.columns:
            # Get the distinct values of the column
            unique_val = df.select(col).distinct().collect()[0][0]
            # See whether the unique value is only none/nan or null
            if str(unique_val).lower() in null_set:
                print("Dropping " + col + " because of all null values.")
                df = df.drop(col)
        return(df)
    
    0 讨论(0)
提交回复
热议问题