How to drop columns based on multiple filters in a dataframe using PySpark?

前端 未结 1 1759
渐次进展
渐次进展 2021-01-15 14:51

I have a list of valid values that a cell can have. If one cell in a column is invalid, I need to drop the whole column. I understand there are answers of dropping rows in a

相关标签:
1条回答
  • 2021-01-15 15:48

    I am not only looking at the code solution but more on the off-the-shelf code provided from PySpark.

    Unfortunately, Spark is designed to operate in parallel on a row-by-row basis. Filtering out columns is not something for which there will be an "off-the-shelf code" solution.

    Nevertheless, here is one approach you can take:

    First collect the counts of the invalid elements in each column.

    from pyspark.sql.functions import col, lit, sum as _sum, when
    
    valid = ['Messi', 'Ronaldo', 'Virgil']
    invalid_counts = df.select(
        *[_sum(when(col(c).isin(valid), lit(0)).otherwise(lit(1))).alias(c) for c in df.columns]
    ).collect()
    print(invalid_counts)
    #[Row(Column 1=0, Column 2=1, Column 3=0, Column 4=1, Column 5=3)]
    

    This output will be a list with only one element. You can iterate over the items in this element to find the columns to keep.

    valid_columns = [k for k,v in invalid_counts[0].asDict().items() if v == 0]
    print(valid_columns)
    #['Column 3', 'Column 1']
    

    Now just select these columns from your original DataFrame. You can first sort valid_columns using list.index if you want to maintain the original column order.

    valid_columns = sorted(valid_columns, key=df.columns.index)
    df.select(valid_columns).show()
    #+--------+--------+
    #|Column 1|Column 3|
    #+--------+--------+
    #| Ronaldo|   Messi|
    #| Ronaldo|  Virgil|
    #| Ronaldo|   Messi|
    #+--------+--------+
    
    0 讨论(0)
提交回复
热议问题