Drop if all entries in a spark dataframe's specific column is null

后端未结

关注

 8  1174

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

相关标签:

8条回答

逝去的感伤

2021-01-13 19:32
Just picking up pieces from the answers above, wrote my own solution for my use case.

What I essentially was trying to do is remove all columns from my pyspark dataframe which had 100% null values.
```
# identify and remove all columns having 100% null values
df_summary_count = your_df.summary("count")
null_cols = [c for c in df_summary_count .columns if df_summary_count.select(c).first()[c] == '0']
filtered_df = df_summary_count .drop(*null_cols)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

闹比i

2021-01-13 19:33

This is a robust solution that takes into consideration all possible combinations of nulls that could be in a column. First, all null columns are found and then they are dropped. It looks lengthy and cumbersome, but in fact this is a robust solution. Only one loop is used for the finding of the null columns and no memory intensive function such as collect() is applied, which should make this solution fast and efficient.

rows = [(None, 18, None, None),
            (1, None, None, None),
            (1, 9, 4.0, None),
            (None, 0, 0., None)]

schema = "a: int, b: int, c: float, d:int"
df = spark.createDataFrame(data=rows, schema=schema)

def get_null_column_names(df):
    column_names = []

    for col_name in df.columns:

        min_ = df.select(F.min(col_name)).first()[0]
        max_ = df.select(F.max(col_name)).first()[0]

        if min_ is None and max_ is None:
            column_names.append(col_name)

    return column_names

null_columns = get_null_column_names(df)

def drop_column(null_columns, df):
  for column_ in null_columns:
    df = df.drop(column_)
    return df

df = drop_column(null_columns, df)
df.show()

Output:

0 讨论(0)

礼貌的吻别

2021-01-13 19:35

Or just

from pyspark.sql.functions import col

for c in df.columns:
    if df.filter(col(c).isNotNull()).count() == 0:
      df = df.drop(c)

0 讨论(0)

臣服心动

2021-01-13 19:37

I tried my way. Say, I have a dataframe as below,

>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|null|
|null|   3|null|
|   5|null|null|
+----+----+----+

>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   2|   0|
+----+----+----+

>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|null|   3|
|   5|null|
+----+----+

0 讨论(0)

终归单人心

2021-01-13 19:43

One of the indirect way to do so is

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Update:
The above code drops columns with all nan. If you are looking for all nulls then

import pyspark.sql.functions as func

for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Will update my answer if I find some optimal way :-)

0 讨论(0)

悲&欢浪女

2021-01-13 19:45

This is a function I have in my pipeline to remove null columns. Hope it helps!

# Function to drop the empty columns of a DF
def dropNullColumns(df):
    # A set of all the null values you can encounter
    null_set = {"none", "null" , "nan"}
    # Iterate over each column in the DF
    for col in df.columns:
        # Get the distinct values of the column
        unique_val = df.select(col).distinct().collect()[0][0]
        # See whether the unique value is only none/nan or null
        if str(unique_val).lower() in null_set:
            print("Dropping " + col + " because of all null values.")
            df = df.drop(col)
    return(df)

0 讨论(0)

1 2 下一页