I have a simple question in spark transformation function.
coalesce(numPartitions) - Decrease the number of partitions in the RDD to numPartitions. Useful for runn
The coalesce
transformation is used to reduce the number of partitions. coalesce
should be used if the number of output partitions is less than the input. It can trigger RDD shuffling depending on the shuffle flag which is disabled by default (i.e. false).
If number of partitions is larger than current number of partitions and you are using coalesce
method without shuffle=true flag then number of partitions remains unchanged.coalesce
doesn't guarantee that the empty partitions will be removed. For example if you have 20 empty partitions and 10 partitions with data, then there will still be empty partitions after you call rdd.coalesce(25)
. If you use coalesce
with shuffle set to true then this will be equivalent to repartition
method and data will be evenly distributed across the partitions.