How to add third party java jars for use in pyspark

余生颓废 提交于 2019-11-26 19:51:26

问题


I have some third party Database client libraries in Java. I want to access them through

java_gateway.py

E.g: to make the client class (not a jdbc driver!) available to the python client via the java gateway:

java_import(gateway.jvm, "org.mydatabase.MyDBClient")

It is not clear where to add the third party libraries to the jvm classpath. I tried to add to compute-classpath.sh but that did nto seem to work: I get

 Py4jError: Trying to call a package

Also, when comparing to Hive: the hive jar files are NOT loaded via compute-classpath.sh so that makes me suspicious. There seems to be some other mechanism happening to set up the jvm side classpath.


回答1:


You can add external jars as arguments to pyspark

pyspark --jars file1.jar,file2.jar



回答2:


You could add the path to jar file using Spark configuration at Runtime.

Here is an example :

conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the document for more information.




回答3:


You could add --jars xxx.jar when using spark-submit

./bin/spark-submit --jars xxx.jar your_spark_script.py

or set the enviroment variable SPARK_CLASSPATH

SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py

your_spark_script.py was written by pyspark API




回答4:


  1. Extract the downloaded jar file.
  2. Edit system environment variable
    • Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.

Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts

  1. Restart your cluster



回答5:


All the above answers did not work for me

What I had to do with pyspark was

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104




回答6:


One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars

Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.

This way you can use the jar without sending it in command line or load it in your code.



来源:https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!