问题
I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last option.
I already went through the same kind of discussion on SO and elsewhere but could not find an answer wrt Python. Can you please suggest.
回答1:
b) I dont want to store intermediate files
c) I dont want to install packages like Cascading, Yelp, Oozie.
Any reason why? Based on the response, a better solution could be provided.
Intermediates files cannot be avoided, because the o/p of the previous Hadoop job cannot be streamed as i/p to the next job. Create a script like this
run streaming job1
if job1 is not success then exit
run streaming job2
if job2 is success them remove o/p of job1 else exit
run streaming job3
if job3 is succcess them remove o/p of job2 else exit
回答2:
Why not using MapReduce frameworks for python streaming, like Dumbo https://github.com/klbostee/dumbo/wiki/Short-tutorial, or MRJob http://packages.python.org/mrjob/
For example, with dumbo, your pipe would be:
job.add_iter(Mapper1, Reducer1)
job.add_iter(Mapper2, Reducer2)
来源:https://stackoverflow.com/questions/8860214/can-we-cascade-multiple-mapreduce-jobs-in-hadoop-streaming-lang-python