问题
I'm trying to run a custom HDFS reader class in PySpark. This class is written in Java and I need to access it from PySpark, either from the shell or with spark-submit.
In PySpark, I retrieve the JavaGateway from the SparkContext (sc._gateway
).
Say I have a class:
package org.foo.module
public class Foo {
public int fooMethod() {
return 1;
}
}
I've tried to package it into a jar and pass it with the --jar
option to pyspark and then running:
from py4j.java_gateway import java_import
jvm = sc._gateway.jvm
java_import(jvm, "org.foo.module.*")
foo = jvm.org.foo.module.Foo()
But I get the error:
Py4JError: Trying to call a package.
Can someone help with this? Thanks.
回答1:
In PySpark try the following
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"org.foo.module.Foo")
func = sc._gateway.jvm.Foo()
func.fooMethod()
Make sure that you have compiled your Java code into a runnable jar and submit the spark job like so
spark-submit --driver-class-path "name_of_your_jar_file.jar" --jars "name_of_your_jar_file.jar" name_of_your_python_file.py
回答2:
Problem you've described usually indicates that org.foo.module
is not on the driver CLASSPATH. One possible solution is to use spark.driver.extraClassPath
to add your jar file. It can be for example set in conf/spark-defaults.conf
or provided as a command line parameter.
On a side note:
if class you use is a custom input format there should be no need for using Py4j gateway whatsoever. You can simply use
SparkContext.hadoop*
/SparkContext.newAPIHadoop*
methods.using
java_import(jvm, "org.foo.module.*")
looks like a bad idea. Generally speaking you should avoid unnecessary imports on JVM. It is not public for a reason and you really don't want to mess with that. Especially when you access in a way which make this import completely obsolete. So dropjava_import
and stick withjvm.org.foo.module.Foo()
.
回答3:
Rather than --jars
you should use --packages
to import packages into your spark-submit
action.
来源:https://stackoverflow.com/questions/33544105/running-custom-java-class-in-pyspark