Hadoop - Globally sort mean and when is happen in MapReduce

早过忘川 提交于 2019-12-13 13:22:39

问题


I am using Hadoop streaming JAR for WordCount, I want to know how can I get Globally Sort, according to answer on another question in SO, I found that when we use of just one reducer we can get Globally sort but in my result with numReduceTasks=1 (one reducer) it is not sort.

For example, my input to mapper is:

file 1: A long time ago in a galaxy far far away

file 2: Another episode for Star Wars

Result is:

A 1

a 1

Star 1

ago 1

for 1

far 2

away 1

time 1

Wars 1

long 1

Another 1

in 1

episode 1

galaxy 1

But this is no a Globally Sort!

So, What is meaning of Sort in Shuffle and Sort and Globally Sort?

mapper code:

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:  
    line = line.strip()    
    words = line.split()    
    for word in words:
        print '%s\t%s' % (word, 1)

reducer code:

#!/usr/bin/env python

import sys

word2count = {} 

for line in sys.stdin:

    line = line.strip()

    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )

I use this command to run it:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks=1

来源:https://stackoverflow.com/questions/40641984/hadoop-globally-sort-mean-and-when-is-happen-in-mapreduce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!