I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster.
I uploaded a
I add the following lines to ~/.bashrc
for emr 4.3:
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
Here py4j-0.XXX-src.zip
is the py4j file in your spark python library folder. Search /usr/lib/spark/python/lib/
to find the exact version and replace the XXX
with that version number.
Run source ~/.bashrc
and you should be good.