Amazon EMR Pyspark Module not found

后端未结

关注

 3  1233

故里飘歌 2021-02-20 04:32

I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster.

I uploaded a

3条回答

灰色年华 (楼主)

2021-02-20 05:33

You probably need to add the pyspark files to the path. I typically use a function like the following.

def configure_spark(spark_home=None, pyspark_python=None):
    spark_home = spark_home or "/path/to/default/spark/home"
    os.environ['SPARK_HOME'] = spark_home

    # Add the PySpark directories to the Python path:
    sys.path.insert(1, os.path.join(spark_home, 'python'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))

    # If PySpark isn't specified, use currently running Python binary:
    pyspark_python = pyspark_python or sys.executable
    os.environ['PYSPARK_PYTHON'] = pyspark_python

Then, you can call the function before importing pyspark:

configure_spark('/path/to/spark/home')
from pyspark import SparkContext

Spark home on an EMR node should be something like /home/hadoop/spark. See https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 for more details.

0 讨论(0)

查看其它3个回答