Add Jar to standalone pyspark

前端未结

关注

 5  1277

I\'m launching a pyspark program:

$ export SPARK_HOME=
$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip
$ python

相关标签:

5条回答

渐次进展

2020-11-27 17:11
I encountered a similar issue for a different jar ("MongoDB Connector for Spark", mongo-spark-connector), but the big caveat was that I installed Spark via pyspark in conda (conda install pyspark). Therefore, all the assistance for Spark-specific answers weren't exactly helpful. For those of you installing with conda, here is the process that I cobbled together:

1) Find where your pyspark/jars are located. Mine were in this path: ~/anaconda2/pkgs/pyspark-2.3.0-py27_0/lib/python2.7/site-packages/pyspark/jars.

2) Download the jar file into the path found in step 1, from this location.

3) Now you should be able to run something like this (code taken from MongoDB official tutorial, using Briford Wylie's answer above):
```
from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
    .getOrCreate()
```
Disclaimers:

1) I don't know if this answer is the right place/SO question to put this; please advise of a better place and I will move it.

2) If you think I have errored or have improvements to the process above, please comment and I will revise.
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-27 17:12
Any dependencies can be passed using spark.jars.packages (setting spark.jars should work as well) property in the $SPARK_HOME/conf/spark-defaults.conf. It should be a comma separated list of coordinates.

And packages or classpath properties have to be set before JVM is started and this happens during SparkConf initialization. It means that SparkConf.set method cannot be used here.

Alternative approach is to set PYSPARK_SUBMIT_ARGS environment variable before SparkConf object is initialized:
```
import os
from pyspark import SparkConf

SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

conf = SparkConf()
sc = SparkContext(conf=conf)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-11-27 17:12
Finally found the answer after a multiple tries. The answer is specific to using spark-csv jar. Create a folder in you hard drive say D:\Spark\spark_jars. Place the following jars there:
1. spark-csv_2.10-1.4.0.jar (this is the version I am using)
2. commons-csv-1.1.jar
3. univocity-parsers-1.5.1.jar
2 and 3 are dependencies required by spark-csv, hence those two files need to be downloaded too. Go to your conf directory where you have downloaded Spark. In the spark-defaults.conf file add the line:

spark.driver.extraClassPath D:/Spark/spark_jars/*

The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as
```
sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-11-27 17:15
There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) some of the answers already cover these. I wanted to add an additional answer for those specifically using Jupyter Notebooks and creating the Spark session from within the notebook. Here's the solution that worked best for me (in my case I wanted the Kafka package loaded):
```
spark = SparkSession.builder.appName('my_awesome')\
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0')\
    .getOrCreate()
```
Using this line of code I didn't need to do anything else (no ENVs or conf file changes).

2019-10-30 Update: The above line of code is still working great but I wanted to note a couple of things for new people seeing this answer:
- You'll need to change the version at the end to match your Spark version, so for Spark 2.4.4 you'll need: org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
- The newest version of this jar spark-sql-kafka-0-10_2.12is crashing for me (Mac Laptop), so if you get a crash when invoking 'readStream' revert to 2.11.
0 讨论(0)
发布评论:

提交评论
- 加载中...

闹比i

2020-11-27 17:20

import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))

Here it comes....

sys.path.insert(0, <PATH TO YOUR JAR>)

Then...

import pyspark
import numpy as np

from pyspark import SparkContext

sc = SparkContext("local[1]")
.
.
.

0 讨论(0)