I have two versions of Python. When I launch a spark application using spark-submit, the application uses the default version of Python. But, I want to use the other one. How to
You can set the PYSPARK_PYTHON
variable in conf/spark-env.sh
(in Spark's installation directory) to the absolute path of the desired Python executable.
Spark distribution contains spark-env.sh.template
(spark-env.cmd.template
on Windows) by default. It must be renamed to spark-env.sh
(spark-env.cmd
) first.
For example, if Python executable is installed under /opt/anaconda3/bin/python3
:
PYSPARK_PYTHON='/opt/anaconda3/bin/python3'
Check out the configuration documentation for more information.
You can either specify the version of Python by listing the path to your install in a shebang line in your script:
myfile.py:
#!/full/path/to/specific/python2.7
or by calling it on the command line without a shebang line in your script:
/full/path/to/specific/python2.7 myfile.py
However, I'd recommend looking into Python's excellent virtual environments that will allow you to create separate "environments" for each version of Python. Virtual environments more or less work by handling all the path specification after you activate them, alllowing you to just type python myfile.py
without worrying about conflicting dependencies or knowing the full path to a specific version of python.
Click here for an excellent guide to getting started with Virtual Environments or [here] for the Python3 official documentation.
If you do not have access to the nodes and you're running this using PySpark, you can specify the Python version in your spark-env.sh:
Spark_Install_Dir/conf/spark-env.sh:
PYSPARK_PYTHON = /full/path/to/python_executable/eg/python2.7
If you want to specify the option PYSPARK_MAJOR_PYTHON_VERSION
in spark-submit
command line, you should check this:
http://spark.apache.org/docs/latest/running-on-kubernetes.html
You can search spark.kubernetes.pyspark.pythonVersion
in this page and you'll find following content:
spark.kubernetes.pyspark.pythonVersion "2" This sets the major Python version of the docker image used to run the driver and executor containers. Can either be 2 or 3.
Now, your command should looks like :
spark-submit --conf spark.kubernetes.pyspark.pythonVersion=3 ...
It should work.
In my environment I simply used
export PYSPARK_PYTHON=python2.7
It worked for me