In a recent SO-post, I discovered that using withColumn
may improve the DAG when dealing with stacked/chain column expressions in conjunction
when using nested withColumns and window functions?
Let's say I want to do:
w1 = ...rangeBetween(-300, 0)
w2 = ...rowsBetween(-1,0)
(df.withColumn("some1", col(f.max("original1").over(w1))
.withColumn("some2", lag("some1")).over(w2)).show()
I got a lot of memory problems and high spill even with very small datasets. If I do the same using select instead of withColumn it performs way faster.
df.select(
f.max(col("original1")).over(w1).alias("some1"),
f.lag("some1")).over(w2)
).show()
This looks like a consequence of the the internal projection caused by withColumn
. It's documented here in the Spark docs
The official recommendation is to do as Jay recommended and instead do a select when dealing with multiple columns