According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of
What follows from the code and code docs is that coalesce(n)
is the same as coalesce(n, shuffle = false)
and repartition(n)
is the same as coalesce(n, shuffle = true)
Thus, both coalesce
and repartition
can be used to increase number of partitions
With
shuffle = true
, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large.
Another important note to accentuate is that if you drastically decrease number of partitions you should consider using shuffled version of coalesce
(same as repartition
in that case). This will allow your computations be performed in parallel on parent partitions (multiple task).
However, if you're doing a drastic coalesce, e.g. to
numPartitions = 1
, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case ofnumPartitions = 1
). To avoid this, you can passshuffle = true
. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Please also refer to the related answer here