Spark: increase number of partitions without causing a shuffle?

后端未结

关注

 3  1505

夕颜 2021-02-07 03:22

When decreasing the number of partitions one can use coalesce, which is great because it doesn\'t cause a shuffle and seems to work instantly (doesn\'t require an a

3条回答

情书的邮戳 (楼主)

2021-02-07 03:47

As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.

You can write :

sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action

df = df.filter(df.B <10)
df = df.count()   

sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot 
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action

0 讨论(0)

查看其它3个回答