Spark - repartition() vs coalesce()

前端 未结 14 1799
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  忘了有多久
    2020-11-22 17:30

    Also another difference is taking into consideration a situation where there is a skew join and you have to coalesce on top of it. A repartition will solve the skew join in most cases, then you can do the coalesce.

    Another situation is, suppose you have saved a medium/large volume of data in a data frame and you have to produce to Kafka in batches. A repartition helps to collectasList before producing to Kafka in certain cases. But, when the volume is really high, the repartition will likely cause serious performance impact. In that case, producing to Kafka directly from dataframe would help.

    side notes: Coalesce does not avoid data movement as in full data movement between workers. It does reduce the number of shuffles happening though. I think that's what the book means.

提交回复
热议问题