Piping data into jobs in Hadoop MR/Pig

问题

I have three different type of jobs running on the data in HDFS. These three jobs have to be run separately in the current scenario. Now, we want to run the three jobs together by piping the OUTPUT data of one job to the other job without writing the data in HDFS to improve the architecture and overall performance.

Any suggestions are welcome for this scenario.

PS : Oozie is not fitting for the workflow.Cascading framework is also ruled out because of Scalability issues. Thanks

回答1:

Hadoop inherently writes to storage (e.g. HDFS) after M/R steps. If you want something in memory, maybe you need to look into something like Spark.

回答2:

Oozie helps to chain multiple hadoop jobs(mapreduce, pig, hive, java etc.) together to form a data pipeline application. The built-in support of scheduling and hadoop-related functions makes dev's life much easier to manage complex hadoop related jobs.

However Oozie doesn't necessarily eliminate data storage in HDFS or other forms such as local file system or database, etc. To do that you would need to introduce some in-memory data store, message-queue systems or other system which works for the scale of data you have.

回答3:

you may try using HUE. Refer: http://blog.cloudera.com/blog/2014/10/new-in-cdh-5-2-new-security-app-and-more-in-hue/

CDH 5.2 includes important new usability functionality via Hue, the open source GUI that makes Apache Hadoop easy to use. In addition to shipping a brand-new app for managing security permissions, this release is particularly feature-packed, and is becoming a great complement to BI tools from Cloudera partners like Tableau, MicroStrategy, and Zoomdata because a more usable Hadoop translates into better BI overall across your organization!

来源：https://stackoverflow.com/questions/27506306/piping-data-into-jobs-in-hadoop-mr-pig

标签

Hadoop

MapReduce

Oozie

cascading