KeyError: 'SPARK_HOME' in pyspark on Jupyter on Google-Cloud-DataProc

问题

When trying to show a SparkDF (Test), I get a KeyError, as shown below. Probably something goes wrong in the function I used before Test.show(3).

The KeyError says: KeyError: 'SPARK_HOME'. I assume SPARK_HOME is not defined on the master and/or workers. Is there a way I can specify the SPARK_HOME directory automatically on both? Preferably by using a initialization action.

Py4JJavaErrorTraceback (most recent call last) in () ----> 1 Test.show(3)

/usr/lib/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate)
    255         +---+-----+
    256         """
--> 257         print(self._jdf.showString(n, truncate))
    258 
    259     def __repr__(self):

...

    raise KeyError(key)
KeyError: 'SPARK_HOME'

回答1:

You can simply put the following in an initialization action:

#!/bin/bash

cat << EOF | tee -a /etc/profile.d/custom_env.sh /etc/*bashrc >/dev/null
export SPARK_HOME=/usr/lib/spark/
EOF

You'll want to put that init action before your jupyter installation action to make sure that it's present when the jupyter process starts up.

Edit: To specify the two init actions, you can list them in a comma-separated list without spaces, like this:

gcloud dataproc clusters create \
    --initialization-actions gs://mybucket/spark_home.sh,gs://mybucket/jupyter.sh ...

来源：https://stackoverflow.com/questions/38652940/keyerror-spark-home-in-pyspark-on-jupyter-on-google-cloud-dataproc

标签

pyspark

jupyter

google-cloud-dataproc

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!