Spark - repartition() vs coalesce()

前端 未结 14 1786
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  感情败类
    2020-11-22 17:19

    What follows from the code and code docs is that coalesce(n) is the same as coalesce(n, shuffle = false) and repartition(n) is the same as coalesce(n, shuffle = true)

    Thus, both coalesce and repartition can be used to increase number of partitions

    With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large.

    Another important note to accentuate is that if you drastically decrease number of partitions you should consider using shuffled version of coalesce (same as repartition in that case). This will allow your computations be performed in parallel on parent partitions (multiple task).

    However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

    Please also refer to the related answer here

提交回复
热议问题