问题
I have 1 WorkerNode SPARK HDInsight cluster. I need to use scikit-neuralnetwork and vaderSentiment module in Pyspark Jupyter.
Installed the library using commands below:
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Next I open pyspark terminal and i am able to successfully import the module. Screenshot below.
Now, i open Jupyter Pyspark Notebook:
Just to add, I am able to import pre-installed module from Jupyter like "import pandas"
The installation goes to:
admin123@hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "vaderSentiment"
/usr/bin/anaconda/lib/python2.7/site-packages/vaderSentiment
/usr/local/lib/python2.7/dist-packages/vaderSentiment
For pre-installed modules:
admin123@hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "pandas"
/usr/bin/anaconda/pkgs/pandas-0.17.1-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/pandas-0.16.2-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/bokeh-0.9.0-np19py27_0/Examples/bokeh/compat/pandas
/usr/bin/anaconda/Examples/bokeh/compat/pandas
/usr/bin/anaconda/lib/python2.7/site-packages/pandas
sys.executable path is same in both Jupyter and terminal.
print(sys.executable)
/usr/bin/anaconda/bin/python
Any help would greatly appreciated.
回答1:
The issue is that while you are installing it on the headnode (one of the VMs), you are not installing it on all the other VMs (worker nodes). When the Pyspark app for Jupyter gets created, it gets run in YARN cluster mode, and so the application master starts in a random worker node.
One way of installing the libraries in all worker nodes would be to create a script action that runs against worker nodes and installs the necessary libraries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/
Do note that there's two python installations in the cluster, and you have to refer to the Anaconda installation explicitly. Installing scikit-neuralnetwork would look something like this:
sudo /usr/bin/anaconda/bin/pip install scikit-neuralnetwork
The second way of doing this is to simply ssh into the workernodes from the headnode. First, ssh into the headnode, then figure out the workernode IPs by going to Ambari at: https://YOURCLUSTER.azurehdinsight.net/#/main/hosts. Then, ssh 10.0.0.#
and execute the installation commands yourself for all worker nodes.
I did this for scikit-neuralnetwork and while it does import correctly, it throws saying it cannot create a file in ~/.theano. Because YARN is running Pyspark sessions as the nobody
user, Theano cannot create its config file. Doing a little bit of digging around, I see that there's a way to change where Theano writes/looks for its config file. Please also take care of that while doing the installation: http://deeplearning.net/software/theano/library/config.html#envvar-THEANORC
Forgot to mention, to modify an env var, you need to set the variable when creating the pyspark session. Execute this in the Jupyter notebook:
%%configure -f
{
"conf": {
"spark.executorEnv.THEANORC": "{YOURPATH}",
"spark.yarn.appMasterEnv.THEANORC": "{YOURPATH}"
}
}
Thanks!
回答2:
Easy way to resolve this was:
Create a bash script
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Copy the above created bash script to any container in Azure storage account.
- While creating HDInsight Spark cluster, use script action and mention the above path in URL. Ex: https://sa-account-name.blob.core.windows.net/containername/path-of-installation-file.sh
- Install it in both HeadNodes and WorkerNodes.
- Now, open Jupyter and you should be able to import the modules.
来源:https://stackoverflow.com/questions/38479316/jupyter-pyspark-outputs-no-module-name-sknn-mlp