Shuffle and sort for mapreduce

后端 未结 1 1716
臣服心动
臣服心动 2021-01-12 06:37

I read through the definitive guide and some other links on the web including the one here

My question is

where exactly does shuffling and so

相关标签:
1条回答
  • 2021-01-12 07:04

    Shuffle:

    MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers map outputs to the reducers as inputs is known as the shuffle.

    Sort:

    Sorting happens in various stages of MapReduce program, So can exists in Map and Reduce phases.

    Please have a look at this diagram enter image description here

    Adding more description to above image in Map and Reduce phases.

    The Map Side:

    When the map function starts producing output, it is not simply written to disk. Before Map output writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key.

    The Reduce Side:

    When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This will be done in rounds.

    Source : Hadoop Definitive Guide.

    0 讨论(0)
提交回复
热议问题