According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of
One additional point to note here is that, as the basic principle of Spark RDD is immutability. The repartition or coalesce will create new RDD. The base RDD will continue to have existence with its original number of partitions. In case the use case demands to persist RDD in cache, then the same has to be done for the newly created RDD.
scala> pairMrkt.repartition(10)
res16: org.apache.spark.rdd.RDD[(String, Array[String])] =MapPartitionsRDD[11] at repartition at <console>:26
scala> res16.partitions.length
res17: Int = 10
scala> pairMrkt.partitions.length
res20: Int = 2
In a simple way COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions
REPARTITION:- is for both increase and decrease the no of partitions , But shuffling takes place
Example:-
val rdd = sc.textFile("path",7)
rdd.repartition(10)
rdd.repartition(2)
Both works fine
But we go generally for this two things when we need to see output in one cluster,we go with this.
What follows from the code and code docs is that coalesce(n)
is the same as coalesce(n, shuffle = false)
and repartition(n)
is the same as coalesce(n, shuffle = true)
Thus, both coalesce
and repartition
can be used to increase number of partitions
With
shuffle = true
, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large.
Another important note to accentuate is that if you drastically decrease number of partitions you should consider using shuffled version of coalesce
(same as repartition
in that case). This will allow your computations be performed in parallel on parent partitions (multiple task).
However, if you're doing a drastic coalesce, e.g. to
numPartitions = 1
, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case ofnumPartitions = 1
). To avoid this, you can passshuffle = true
. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Please also refer to the related answer here
All the answers are adding some great knowledge into this very often asked question.
So going by tradition of this question's timeline, here are my 2 cents.
I found the repartition to be faster than coalesce, in very specific case.
In my application when the number of files that we estimate is lower than the certain threshold, repartition works faster.
Here is what I mean
if(numFiles > 20)
df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
else
df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
In above snippet, if my files were less than 20, coalesce was taking forever to finish while repartition was much faster and so the above code.
Of course, this number (20) will depend on the number of workers and amount of data.
Hope that helps.
The repartition
algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce
combines existing partitions to avoid a full shuffle.
Coalesce works well for taking an RDD with a lot of partitions and combining partitions on a single worker node to produce a final RDD with less partitions.
Repartition
will reshuffle the data in your RDD to produce the final number of partitions you request.
The partitioning of DataFrames seems like a low level implementation detail that should be managed by the framework, but it’s not. When filtering large DataFrames into smaller ones, you should almost always repartition the data.
You’ll probably be filtering large DataFrames into smaller ones frequently, so get used to repartitioning.
Read this blog post if you'd like even more details.
Also another difference is taking into consideration a situation where there is a skew join and you have to coalesce on top of it. A repartition will solve the skew join in most cases, then you can do the coalesce.
Another situation is, suppose you have saved a medium/large volume of data in a data frame and you have to produce to Kafka in batches. A repartition helps to collectasList before producing to Kafka in certain cases. But, when the volume is really high, the repartition will likely cause serious performance impact. In that case, producing to Kafka directly from dataframe would help.
side notes: Coalesce does not avoid data movement as in full data movement between workers. It does reduce the number of shuffles happening though. I think that's what the book means.