Spark - repartition() vs coalesce()

前端 未结 14 1746
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  南笙
    南笙 (楼主)
    2020-11-22 17:42

    repartition - it's recommended to use it while increasing the number of partitions, because it involve shuffling of all the data.

    coalesce - it's is recommended to use it while reducing the number of partitions. For example if you have 3 partitions and you want to reduce it to 2, coalesce will move the 3rd partition data to partition 1 and 2. Partition 1 and 2 will remains in the same container. On the other hand, repartition will shuffle data in all the partitions, therefore the network usage between the executors will be high and it will impacts the performance.

    coalesce performs better than repartition while reducing the number of partitions.

提交回复
热议问题