ImportError: No module named numpy on spark workers

后端 未结 6 1213
抹茶落季
抹茶落季 2020-12-05 03:12

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Someho

相关标签:
6条回答
  • 2020-12-05 03:47

    To use Spark in Yarn client mode, you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.

    Using Spark with Yarn cluster mode is a different story. You can distribute python dependencies with spark-submit.

    spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
    

    However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that does the heavy lifting in C. Because of the way that it is installed, you won't be able to distribute numpy in this fashion.

    0 讨论(0)
  • 2020-12-05 03:48

    What solved it for me (On mac) was actually this guide (Which also explains how to run python through Jupyter Notebooks - https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735

    In a nutshell: (Assuming you installed spark with brew install spark)

    1. Find the SPARK_PATH using - brew info apache-spark
    2. Add those lines to your ~/.bash_profile
    # Spark and Python
    ######
    export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.1
    export PYSPARK_DRIVER_PYTHON="jupyter"
    export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
    #For python 3, You have to add the line below or you will get an error
    export PYSPARK_PYTHON=python3
    alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
    ######
    
    1. You should be able to open Jupyter Notebook simply by calling: pyspark

    And just remember you don't need to set the Spark Context but instead simply call:

    sc = SparkContext.getOrCreate()
    
    0 讨论(0)
  • 2020-12-05 03:51

    I had similar issue but I dont think you need to set PYSPARK_PYTHON instead just install numpy on the worker machine (apt-get or yum). The error will also tell you on which machine the import was missing.

    0 讨论(0)
  • 2020-12-05 03:51

    You have to be aware that you need to have numpy installed on each and every worker, and even the master itself (depending on your component placement)

    Also ensure to launch pip install numpy command from a root account (sudo does not suffice) after forcing umask to 022 (umask 022) so it cascades the rights to Spark (or Zeppelin) User

    0 讨论(0)
  • 2020-12-05 04:01

    numpy is not installed on the worker (virtual) machines. If you use anaconda, it's very convenient to upload such python dependencies when deploying the application in cluster mode. (So there is no need to install numpy or other modules on each machine, instead they must in your anaconda). Firstly, zip your anaconda and put the zip file to the cluster, and then you can submit a job using following script.

     spark-submit \
     --master yarn \
     --deploy-mode cluster \
     --archives hdfs://host/path/to/anaconda.zip#python-env
     --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python 
     app_main.py
    

    Yarn will copy anaconda.zip from the hdfs path to each worker, and use that pthon-env/anaconda/bin/python to execute tasks.

    Refer to Running PySpark with Virtualenv may provide more information.

    0 讨论(0)
  • 2020-12-05 04:11

    I had the same issue. Try installing numpy on pip3 if you're using Python3

    pip3 install numpy

    0 讨论(0)
提交回复
热议问题