Spark: increase number of partitions without causing a shuffle?

后端 未结 3 1504
夕颜
夕颜 2021-02-07 03:22

When decreasing the number of partitions one can use coalesce, which is great because it doesn\'t cause a shuffle and seems to work instantly (doesn\'t require an a

3条回答
  •  走了就别回头了
    2021-02-07 03:36

    I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much sense… The process of sending data to new partitions has to happen sometime.

    When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.

    If you want the process to go the other way, you could just force some kind of partitioning:

    [RDD].partitionBy(new HashPartitioner(100))
    

    I'm not sure that's what you're looking for, but hope so.

提交回复
热议问题