问题
In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.
Pyspark kernel config:
PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'
Following is the cmd to initialize cluster :
gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m --worker-machine-type=n1-standard-4 --master-machine-type=n1-standard-4
回答1:
This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.
The suggested workaround is adding
import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))
before your import.
回答2:
I found another way to do add packages which works on Jupyter notebook:
spark = SparkSession.builder \
.appName("Python Spark SQL") \ \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()
回答3:
If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.
More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html
回答4:
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached
Just open your terminal and set the two environment variables and start pyspark
with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit
you can use the same start command
来源:https://stackoverflow.com/questions/40894739/dataproc-jupyter-pyspark-notebook-unable-to-import-graphframes-package