What is shuffle read & shuffle write in Apache Spark

后端 未结 2 1690
心在旅途
心在旅途 2021-02-03 23:07

In below screenshot of Spark admin running on port 8080 :

\"enter

The \"Shuffle R

2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-03 23:27

    Shuffling means the reallocation of data between multiple Spark stages. "Shuffle Write" is the sum of all written serialized data on all executors before transmitting (normally at the end of a stage) and "Shuffle Read" means the sum of read serialized data on all executors at the beginning of a stage.

    Your programm has only one stage, triggered by the "collect" operation. No shuffling is required, because you have only a bunch of consecutive map operations which are pipelined in one Stage.

    Try to take a look at these slides: http://de.slideshare.net/colorant/spark-shuffle-introduction

    It could also help to read chapture 5 from the original paper: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

提交回复
热议问题