What is shuffle read & shuffle write in Apache Spark

后端 未结 2 1691
心在旅途
心在旅途 2021-02-03 23:07

In below screenshot of Spark admin running on port 8080 :

\"enter

The \"Shuffle R

相关标签:
2条回答
  • 2021-02-03 23:24

    I believe you have to run your application in cluster/distributed mode to see any Shuffle read or write values. Typically "shuffle" are triggered by a subset of Spark actions (e.g., groupBy, join, etc)

    0 讨论(0)
  • 2021-02-03 23:27

    Shuffling means the reallocation of data between multiple Spark stages. "Shuffle Write" is the sum of all written serialized data on all executors before transmitting (normally at the end of a stage) and "Shuffle Read" means the sum of read serialized data on all executors at the beginning of a stage.

    Your programm has only one stage, triggered by the "collect" operation. No shuffling is required, because you have only a bunch of consecutive map operations which are pipelined in one Stage.

    Try to take a look at these slides: http://de.slideshare.net/colorant/spark-shuffle-introduction

    It could also help to read chapture 5 from the original paper: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

    0 讨论(0)
提交回复
热议问题