Spark DAG differs with 'withColumn' vs 'select'

前端 未结 2 1127
小鲜肉
小鲜肉 2021-01-05 17:16

Context

In a recent SO-post, I discovered that using withColumn may improve the DAG when dealing with stacked/chain column expressions in conjunction

相关标签:
2条回答
  • 2021-01-05 17:40

    when using nested withColumns and window functions?

    Let's say I want to do:

    w1 = ...rangeBetween(-300, 0)
    w2 = ...rowsBetween(-1,0)
    
    (df.withColumn("some1", col(f.max("original1").over(w1))
       .withColumn("some2", lag("some1")).over(w2)).show()
    
    

    I got a lot of memory problems and high spill even with very small datasets. If I do the same using select instead of withColumn it performs way faster.

    df.select(
        f.max(col("original1")).over(w1).alias("some1"),
        f.lag("some1")).over(w2)
    ).show()
    
    0 讨论(0)
  • 2021-01-05 17:52

    This looks like a consequence of the the internal projection caused by withColumn. It's documented here in the Spark docs

    The official recommendation is to do as Jay recommended and instead do a select when dealing with multiple columns

    0 讨论(0)
提交回复
热议问题