How to add third party java jars for use in pyspark

前端未结

关注

 9  1805

I have some third party Database client libraries in Java. I want to access them through

java_gateway.py

E.g: to make the client class (not

相关标签:

9条回答

情歌与酒

2020-11-29 03:21
You could add --jars xxx.jar when using spark-submit
```
./bin/spark-submit --jars xxx.jar your_spark_script.py
```
or set the enviroment variable SPARK_CLASSPATH
```
SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py
```
your_spark_script.py was written by pyspark API
0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-11-29 03:23

One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars

Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.

This way you can use the jar without sending it in command line or load it in your code.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2020-11-29 03:26
You could add the path to jar file using Spark configuration at Runtime.

Here is an example :
```
conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)
```
Refer the document for more information.
0 讨论(0)
发布评论:

提交评论
- 加载中...

被撕碎了的回忆

2020-11-29 03:29

All the above answers did not work for me

What I had to do with pyspark was

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104

0 讨论(0)

感动是毒

2020-11-29 03:36
java/scala libs from pyspark both --jars and spark.jars are not working in version 2.4.0 and earlier (I didn't check newer version). I'm surprised how many guys are claiming that it is working.

The main problem is that for classloader retrieved in following way:
```
jvm = SparkSession.builder.getOrCreate()._jvm
clazz = jvm.my.scala.class
# or
clazz = jvm.java.lang.Class.forName('my.scala.class')
```
it works only when you copy jar files to ${SPARK_HOME}/jars (this one works for me).

But when your only way is using --jars or spark.jars there is another classloader used (which is child class loader) which is set in current thread. So your python code needs to look like:
```
clazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(f"{object_name}$")
```
Hope it explains your troubles. Give me a shout if not.
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-11-29 03:38
You can add external jars as arguments to pyspark
```
pyspark --jars file1.jar,file2.jar
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页