Chaining multiple mapreduce tasks in Hadoop streaming

I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.

Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

Disclaimer: I'm the author of Cascading

You can try out Yelp's MRJob to get your job done.. Its an opensource MapReduce Library that allows you to write chained jobs that can be run atop Hadoop Streaming on your Hadoop Cluster or EC2.. Its pretty elegant and easy to use, and has a method called steps which you can override to specify the exact chain of mappers and reducers that you want your data to go through.

Checkout the source at https://github.com/Yelp/mrjob
and documentation at http://packages.python.org/mrjob/

Typically the way I do this with Hadoop streaming and Python is from within my bash script that I create to run the jobs in the first place. Always I run from a bash script, this way I can get emails on errors and emails on success and make them more flexible passing in parameters from another Ruby or Python script wrapping it that can work in a larger event processing system.

So, the output of the first command (job) is the input to the next command (job) which can be variables in your bash script passed in as an argument from the command line (simple and quick)

You might want to checkout Oozie http://yahoo.github.com/oozie/design.html a workflow engine for Hadoop that will help to-do this also (supports streaming, not a problem). I did not have this when I started so I ended up having to build my own thing but this is a kewl system and useful!!!!

If you are already writing your mapper and reducer in Python, I would consider using Dumbo where such an operation is straightforward. The sequence of your map reduce jobs, your mapper, reducer etc. are all in one python script that can be run from the command line.

来源：https://stackoverflow.com/questions/4626356/chaining-multiple-mapreduce-tasks-in-hadoop-streaming

标签

python

Hadoop

MapReduce

hadoop-plugins