Spark: increase number of partitions without causing a shuffle?

后端未结

关注

 3  1511

When decreasing the number of partitions one can use coalesce, which is great because it doesn\'t cause a shuffle and seems to work instantly (doesn\'t require an a

相关标签:

3条回答

走了就别回头了

2021-02-07 03:36
I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much sense… The process of sending data to new partitions has to happen sometime.

When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.

If you want the process to go the other way, you could just force some kind of partitioning:
```
[RDD].partitionBy(new HashPartitioner(100))
```
I'm not sure that's what you're looking for, but hope so.
0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2021-02-07 03:38

Watch this space

https://issues.apache.org/jira/browse/SPARK-5997

This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.

0 讨论(0)
发布评论:

提交评论
- 加载中...

情书的邮戳

2021-02-07 03:47

As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.

You can write :

sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action

df = df.filter(df.B <10)
df = df.count()   

sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot 
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action

0 讨论(0)