jupyter pyspark outputs: No module name sknn.mlp

℡╲_俬逩灬. 提交于 2019-12-10 11:55:30

问题


I have 1 WorkerNode SPARK HDInsight cluster. I need to use scikit-neuralnetwork and vaderSentiment module in Pyspark Jupyter.

Installed the library using commands below:

cd /usr/bin/anaconda/bin/

export PATH=/usr/bin/anaconda/bin:$PATH

conda update matplotlib

conda install Theano

pip install scikit-neuralnetwork

pip install vaderSentiment

Next I open pyspark terminal and i am able to successfully import the module. Screenshot below.

Now, i open Jupyter Pyspark Notebook:

Just to add, I am able to import pre-installed module from Jupyter like "import pandas"

The installation goes to:

admin123@hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "vaderSentiment"
/usr/bin/anaconda/lib/python2.7/site-packages/vaderSentiment
/usr/local/lib/python2.7/dist-packages/vaderSentiment

For pre-installed modules:

admin123@hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "pandas"
/usr/bin/anaconda/pkgs/pandas-0.17.1-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/pandas-0.16.2-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/bokeh-0.9.0-np19py27_0/Examples/bokeh/compat/pandas
/usr/bin/anaconda/Examples/bokeh/compat/pandas
/usr/bin/anaconda/lib/python2.7/site-packages/pandas

sys.executable path is same in both Jupyter and terminal.

print(sys.executable)
/usr/bin/anaconda/bin/python

Any help would greatly appreciated.


回答1:


The issue is that while you are installing it on the headnode (one of the VMs), you are not installing it on all the other VMs (worker nodes). When the Pyspark app for Jupyter gets created, it gets run in YARN cluster mode, and so the application master starts in a random worker node.

One way of installing the libraries in all worker nodes would be to create a script action that runs against worker nodes and installs the necessary libraries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/

Do note that there's two python installations in the cluster, and you have to refer to the Anaconda installation explicitly. Installing scikit-neuralnetwork would look something like this:

sudo /usr/bin/anaconda/bin/pip install scikit-neuralnetwork

The second way of doing this is to simply ssh into the workernodes from the headnode. First, ssh into the headnode, then figure out the workernode IPs by going to Ambari at: https://YOURCLUSTER.azurehdinsight.net/#/main/hosts. Then, ssh 10.0.0.# and execute the installation commands yourself for all worker nodes.

I did this for scikit-neuralnetwork and while it does import correctly, it throws saying it cannot create a file in ~/.theano. Because YARN is running Pyspark sessions as the nobody user, Theano cannot create its config file. Doing a little bit of digging around, I see that there's a way to change where Theano writes/looks for its config file. Please also take care of that while doing the installation: http://deeplearning.net/software/theano/library/config.html#envvar-THEANORC

Forgot to mention, to modify an env var, you need to set the variable when creating the pyspark session. Execute this in the Jupyter notebook:

%%configure -f
{
    "conf": {
        "spark.executorEnv.THEANORC": "{YOURPATH}",
        "spark.yarn.appMasterEnv.THEANORC": "{YOURPATH}"
    }
}

Thanks!




回答2:


Easy way to resolve this was:

  1. Create a bash script

    cd /usr/bin/anaconda/bin/

    export PATH=/usr/bin/anaconda/bin:$PATH

    conda update matplotlib

    conda install Theano

    pip install scikit-neuralnetwork

    pip install vaderSentiment

  2. Copy the above created bash script to any container in Azure storage account.

  3. While creating HDInsight Spark cluster, use script action and mention the above path in URL. Ex: https://sa-account-name.blob.core.windows.net/containername/path-of-installation-file.sh
  4. Install it in both HeadNodes and WorkerNodes.
  5. Now, open Jupyter and you should be able to import the modules.


来源:https://stackoverflow.com/questions/38479316/jupyter-pyspark-outputs-no-module-name-sknn-mlp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!