Python实现Hadoop MapReduce程序

1.概述

Hadoop Streaming提供了一个便于进行MapReduce编程的工具包，使用它可以基于一些可执行命令、脚本语言或其他编程语言来实现Mapper和 Reducer，从而充分利用Hadoop并行计算框架的优势和能力，来处理大数据。需要注意的是，Streaming方式是基于Unix系统的标准输入输出来进行MapReduce Job的运行，它区别与Pipes的地方主要是通信协议，Pipes使用的是Socket通信，是对使用C++语言来实现MapReduce Job并通过Socket通信来与Hadopp平台通信，完成Job的执行。任何支持标准输入输出特性的编程语言都可以使用Streaming方式来实现MapReduce Job，基本原理就是输入从Unix系统标准输入，输出使用Unix系统的标准输出。

2.Hadoop Streaming原理

mapper和reducer会从标准输入中读取用户数据，一行一行处理后发送给标准输出。Streaming工具会创建MapReduce作业，发送给各个tasktracker，同时监控整个作业的执行过程。

如果一个文件（可执行或者脚本）作为mapper，mapper初始化时，每一个mapper任务会把该文件作为一个单独进程启动，mapper任务运行时，它把输入切分成行并把每一行提供给可执行文件进程的标准输入。同时，mapper收集可执行文件进程标准输出的内容，并把收到的每一行内容转化成key/value对，作为mapper的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包括tab）作为value。如果没有tab，整行作为key值，value值为null。

对于reducer，类似。以上是Map/Reduce框架和streaming mapper/reducer之间的基本通信协议。

3.Hadoop Streaming用法

Usage: $HADOOP_HOME/bin/hadoop jar \

$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar

options：

　　（1）-input：输入文件路径

　　（2）-output：输出文件路径

　　（3）-mapper：用户自己写的mapper程序，可以是可执行文件或者脚本

　　（4）-reducer：用户自己写的reducer程序，可以是可执行文件或者脚本

　　（5）-file：打包文件到提交的作业中，可以是mapper或者reducer要用的输入文件，如配置文件，字典等。

　　（6）-partitioner：用户自定义的partitioner程序

　　（7）-combiner：用户自定义的combiner程序（必须用java实现）

　　（8）-D：作业的一些属性（以前用的是-jonconf），具体有：
1）mapred.map.tasks：map task数目
2）mapred.reduce.tasks：reduce task数目
3）stream.map.input.field.separator/stream.map.output.field.separator： map task输入/输出数据的分隔符,默认均为\t。
4）stream.num.map.output.key.fields：指定map task输出记录中key所占的域数目
5）stream.reduce.input.field.separator/stream.reduce.output.field.separator：reduce task输入/输出数据的分隔符，默认均为\t。
6）stream.num.reduce.output.key.fields：指定reduce task输出记录中key所占的域数目

另外，Hadoop本身还自带一些好用的Mapper和Reducer：

4.使用示例

使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为 HadoopStreaming会帮我们办好其他事。这是真的，别不相信！

举例

将下列的代码保存在/usr/local/hadoop/mapper.py中，他将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：注意：要确保这个脚本有足够权限（chmod +x mapper.py）。

#!/usr/bin/env python  
  
import sys  
  
# input comes from STDIN (standard input)  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
    # split the line into words  
    words = line.split()  
    # increase counters  
    for word in words:  
        # write the results to STDOUT (standard output);  
        # what we output here will be the input for the  
        # Reduce step, i.e. the input for reducer.py  
        #  
        # tab-delimited; the trivial word count is 1  
        print '%s\t%s' % (word, 1)

将代码存储在/usr/local/hadoop/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。同样，要注意脚本权限：chmod +x reducer.py

#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
  
    # parse the input we got from mapper.py  
    word, count = line.split('\t', 1)  
  
    # convert count (currently a string) to int  
    try:  
        count = int(count)  
    except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
        continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
    if current_word == word:  
        current_count += count  
    else:  
        if current_word:  
            # write result to STDOUT  
            print '%s\t%s' % (current_word, current_count)  
        current_count = count  
        current_word = word  
  
# do not forget to output the last word if needed!  
if current_word == word:  
    print '%s\t%s' % (current_word, current_count)

测试结果：

hadoop@derekUbun:/usr/local/hadoop$ echo "foo foo quux labs foo bar quux" | ./mapper.py  
foo      1  
foo      1  
quux     1  
labs     1  
foo      1  
bar      1  
quux     1  
hadoop@derekUbun:/usr/local/hadoop$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort |./reducer.py  
bar     1  
foo     3  
labs    1  
quux    2

实例

需求：这里面只是个小练习，没有多高深，简单的不能再简单，只是一个小实例，做个抛砖的作用。

写一个mapreduce streaming程序（可使用任意语言，这里我们用python），将数据转换成“key=value”的格式，其中，key包括“ip”、“time”、“path”三个，

比如，175.44.30.93 - - [29/Sep/2013:00:10:16 +0800] "GET /structure/heap/ HTTP/1.1" 200 22539 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1;)"

转化为：ip=175.44.30.93|time=29/Sep/2013:00:10:16|path=/structure/heap/ 其中，不同key/value之间用“|”分割。

具体步骤：

1.将日志文件上传到hdfs上 hadoop fs -put 文件目的地

2.编程程序，这个比较简单，我觉得只用mapper就能实现，我就只写了一个mapper。

1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 
4 import sys
5     
6 for line in sys.stdin: #接受系统的标准输入
7     line = line.strip()
8     lists = line.split()
9     print 'ip=%s|time=%s|path=%s' %(lists[0],lists[3].strip('[]'),lists[6])#处理成想要的结果

3.测试程序执行命令

hadoop jar /home/biedong/hadoop-2.7.0/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar -mapper /home/biedong/test/mapper1.py -input /home/zuoye/access.log -output /home/zuoye/book-output

执行报错：提示找不多执行程序，比如“Caused by: java.io.IOException: Cannot run program “/user/hadoop/Mapper”: error=2, No such file or directory”：

解决办法：可在提交作业时，采用-file选项指定这些文件，比如上面例子中，可以使用“-file Mapper -file Reducer” 或者 “-file Mapper.py -file Reducer.py”，这样，Hadoop会将这两个文件自动分发到各个节点上。

hadoop jar /home/biedong/hadoop-2.7.0/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar -mapper /home/biedong/test/mapper1.py -file /home/biedong/test/mapper1.py -input /home/zuoye/access.log -output /home/zuoye/book-output

执行完成后在hdfs上的结果：文件输出正常，结果也正常1904条。

4.加个reducer吧，这个比较简单，因为mapper已经处理好了，我直接接受mapper的输入，完了直接打印出来。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

for line in sys.stdin:
    print line

问题是：多出一个空行

原因查找：默认情况下，Streaming使用\t分离记录中得键和值，当没有\t时，整个记录被视为键，值为空白文本。在mapper输出的时候会自动在尾行加上\t 因此在reducer接受后，会把数据直接按照\t拆分成k和v两个，只是k是mapper的数据行，v是空白，如果咱们直接输出结果的话，就会有空白行。

来源：https://www.cnblogs.com/chushiyaoyue/p/5713177.html

标签

Hadoop

MapReduce

mapreduce实例

python

数据处理

mapper