How to add third party java jars for use in pyspark

前提是你 提交于 2019-11-27 19:46:41

You can add external jars as arguments to pyspark

pyspark --jars file1.jar,file2.jar

You could add the path to jar file using Spark configuration at Runtime.

Here is an example :

conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the document for more information.

Ryan Chou

You could add --jars xxx.jar when using spark-submit

./bin/spark-submit --jars xxx.jar

or set the enviroment variable SPARK_CLASSPATH

SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' was written by pyspark API

  1. Extract the downloaded jar file.
  2. Edit system environment variable
    • Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.

Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts

  1. Restart your cluster

All the above answers did not work for me

What I had to do with pyspark was

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

spark = (SparkSession
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")

# Do this 


Link to the source where I found it:

One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars

Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.

This way you can use the jar without sending it in command line or load it in your code.
