发表新帖

发表新帖

Spark DAG differs with 'withColumn' vs 'select'

前端未结

关注

 2  1127

Context

In a recent SO-post, I discovered that using withColumn may improve the DAG when dealing with stacked/chain column expressions in conjunction

相关标签:

2条回答

别跟我提以往

2021-01-05 17:40
when using nested withColumns and window functions?

Let's say I want to do:
```
w1 = ...rangeBetween(-300, 0)
w2 = ...rowsBetween(-1,0)

(df.withColumn("some1", col(f.max("original1").over(w1))
   .withColumn("some2", lag("some1")).over(w2)).show()
```
I got a lot of memory problems and high spill even with very small datasets. If I do the same using select instead of withColumn it performs way faster.
```
df.select(
    f.max(col("original1")).over(w1).alias("some1"),
    f.lag("some1")).over(w2)
).show()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2021-01-05 17:52

This looks like a consequence of the the internal projection caused by withColumn. It's documented here in the Spark docs

The official recommendation is to do as Jay recommended and instead do a select when dealing with multiple columns

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题