问题
I have a pyspark code depending on third party librairies. I want to execute this code on my cluster which run under mesos.
I do have a zipped version of my python environment that is on a http server reachable by my cluster.
I have some trouble to specify to my spark-submit query to use this environment.
I use both --archives
to load the zip file and --conf 'spark.pyspark.driver.python=path/to/my/env/bin/python'
plus --conf 'spark.pyspark.python=path/to/my/env/bin/python'
to specify the thing.
This does not seem to work... Do I do something wrong? Do you have any idea on how to do that?
Cheers, ALex
回答1:
To submit you zip folder to python spark, you need to send the files using :
spark-submit --py-files your_zip your_code.py
While using it inside your code, you will have to use below statement:
sc.addPyFile("your_zip")
import your_zip
Hope this will help!!
回答2:
May be helpful to some people, if you have dependencies.
I found a solution on how to properly load a virtual environment to the master and all the slave workers:
virtualenv venv --relocatable
cd venv
zip -qr ../venv.zip *
PYSPARK_PYTHON=./SP/bin/python spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./SP/bin/python --driver-memory 4G --archives venv.zip#SP filename.py
来源:https://stackoverflow.com/questions/48644166/spark-submit-with-specific-python-librairies