Piping data into jobs in Hadoop MR/Pig

我是研究僧i 提交于 2020-01-07 04:31:25

问题


I have three different type of jobs running on the data in HDFS. These three jobs have to be run separately in the current scenario. Now, we want to run the three jobs together by piping the OUTPUT data of one job to the other job without writing the data in HDFS to improve the architecture and overall performance.

Any suggestions are welcome for this scenario.

PS : Oozie is not fitting for the workflow.Cascading framework is also ruled out because of Scalability issues. Thanks


回答1:


Hadoop inherently writes to storage (e.g. HDFS) after M/R steps. If you want something in memory, maybe you need to look into something like Spark.




回答2:


Oozie helps to chain multiple hadoop jobs(mapreduce, pig, hive, java etc.) together to form a data pipeline application. The built-in support of scheduling and hadoop-related functions makes dev's life much easier to manage complex hadoop related jobs.

However Oozie doesn't necessarily eliminate data storage in HDFS or other forms such as local file system or database, etc. To do that you would need to introduce some in-memory data store, message-queue systems or other system which works for the scale of data you have.




回答3:


you may try using HUE. Refer: http://blog.cloudera.com/blog/2014/10/new-in-cdh-5-2-new-security-app-and-more-in-hue/

CDH 5.2 includes important new usability functionality via Hue, the open source GUI that makes Apache Hadoop easy to use. In addition to shipping a brand-new app for managing security permissions, this release is particularly feature-packed, and is becoming a great complement to BI tools from Cloudera partners like Tableau, MicroStrategy, and Zoomdata because a more usable Hadoop translates into better BI overall across your organization!



来源:https://stackoverflow.com/questions/27506306/piping-data-into-jobs-in-hadoop-mr-pig

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!