Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python)

问题

I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last option.

I already went through the same kind of discussion on SO and elsewhere but could not find an answer wrt Python. Can you please suggest.

回答1:

b) I dont want to store intermediate files

c) I dont want to install packages like Cascading, Yelp, Oozie.

Any reason why? Based on the response, a better solution could be provided.

Intermediates files cannot be avoided, because the o/p of the previous Hadoop job cannot be streamed as i/p to the next job. Create a script like this

run streaming job1
if job1 is not success then exit
run streaming job2
if job2 is success them remove o/p of job1 else exit
run streaming job3
if job3 is succcess them remove o/p of job2 else exit

回答2:

Why not using MapReduce frameworks for python streaming, like Dumbo https://github.com/klbostee/dumbo/wiki/Short-tutorial, or MRJob http://packages.python.org/mrjob/

For example, with dumbo, your pipe would be:

job.add_iter(Mapper1, Reducer1)
job.add_iter(Mapper2, Reducer2)

来源：https://stackoverflow.com/questions/8860214/can-we-cascade-multiple-mapreduce-jobs-in-hadoop-streaming-lang-python

标签

python

Hadoop

MapReduce

hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!