When decreasing the number of partitions one can use coalesce
, which is great because it doesn\'t cause a shuffle and seems to work instantly (doesn\'t require an a
As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.
You can write :
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action
df = df.filter(df.B <10)
df = df.count()
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action