Spark scala remove columns containing only null values

前端未结

关注

 3  1898

Is there a way to remove the columns of a spark dataFrame that contain only null values ? (I am using scala and Spark 1.6.2)

At the moment I am doing this:

相关标签:

3条回答

暗喜

2021-01-12 14:43

Here's a scala example to remove null columns that only queries that data once (faster):

def removeNullColumns(df:DataFrame): DataFrame = {
    var dfNoNulls = df
    val exprs = df.columns.map((_ -> "count")).toMap
    val cnts = df.agg(exprs).first
    for(c <- df.columns) {
        val uses = cnts.getAs[Long]("count("+c+")")
        if ( uses == 0 ) {
            dfNoNulls = dfNoNulls.drop(c)
        }
    }
    return dfNoNulls
}

0 讨论(0)

旧时难觅i

2021-01-12 14:48
I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.
```
for (String column:df.columns()){
    long count = df.select(column).distinct().count();

    if(count == 1 && df.select(column).first().isNullAt(0)){
        df = df.drop(column);
    }
}
```
I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2021-01-12 14:48
If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.

scala snippet:
```
originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...