问题
I have written mapper and reducer in python for word count program that works fine. Here is a sample:
echo "hello hello world here hello here world here hello" | wordmapper.py | sort -k1,1 | wordreducer.py
hello 4
here 3
world 2
Now when i try to submit a hadoop job for a large file, I get errors
hadoop jar share/hadoop/tools/sources/hadoop-*streaming*.jar -file wordmapper.py -mapper wordmapper.py -file wordreducer.py -reducer wordreducer.py -input /data/1jrl.pdb -output /output/py_jrl
Exception in thread "main" java.lang.ClassNotFoundException: share.hadoop.tools.sources.hadoop-streaming-2.2.0-test-sources.jar
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
I removed changed the commandline to the following (removed wild card from above);
hadoop jar share/hadoop/tools/sources/hadoop-streaming-2.2.0-sources.jar -file wordmapper.py -mapper wordmapper.py -file wordreducer.py -reducer wordreducer.py -input /data/1jrl.pdb -output /output/py_jrl
Exception in thread "main" java.lang.ClassNotFoundException: -file
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
why I get these errors and how to fix this?
I use hadoop2.
Thanks!
回答1:
Well at least one of your issues is that you are using the -sources.jar
which is just .java
files and can't be executed.
Try using this instead...
share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar
And if that doesn't exist, look for a hadoop-streaming*.jar
that doesn't have -sources
in the file name.
来源:https://stackoverflow.com/questions/24661653/unable-to-run-map-reduce-using-python-in-hadoop