Managing dependencies with Hadoop Streaming?

China☆狼群 提交于 2019-12-25 08:25:38

问题


I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines?


回答1:


If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596




回答2:


If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:

https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files



来源:https://stackoverflow.com/questions/2862345/managing-dependencies-with-hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!