Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

前端 未结 3 599
暖寄归人
暖寄归人 2021-02-01 17:29

I have the following spark job, trying to keep everything in memory:

val myOutRDD = myInRDD.flatMap { fp =>
  val tuple2List: ListBuffer[(String, myClass)] =          


        
3条回答
  •  被撕碎了的回忆
    2021-02-01 18:09

    One more note on how to prevent shuffle spill, since I think that is the most important part of the question from a performance aspect (shuffle write, as mentioned above, is a required part of shuffling).

    Spilling occurs when the at shuffle read, any reducer cannot fit all of the records assigned to it in memory in the shuffle space on that executor. If your shuffle is unbalanced (e.g. some output partitions are much larger than some input partitions), you may have shuffle spill even if the partitions "fit in memory" before the shuffle. The best way to control this is by A) balancing the shuffle... e.g changing your code to reduce before shuffling or by shuffling on different keys or B) changing the shuffle memory settings as suggested above Given the extent of the spill to disk you probably need to do A rather than B.

提交回复
热议问题