Dataproc: Jupyter pyspark notebook unable to import graphframes package

问题

In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.

Pyspark kernel config:

PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'

Following is the cmd to initialize cluster :

gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions  gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m     --worker-machine-type=n1-standard-4  --master-machine-type=n1-standard-4

回答1:

This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.

The suggested workaround is adding

import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))

before your import.

回答2:

I found another way to do add packages which works on Jupyter notebook:

spark = SparkSession.builder \
.appName("Python Spark SQL") \    \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()

回答3:

If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.

More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

回答4:

The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached

Just open your terminal and set the two environment variables and start pyspark with the graphframes package

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

来源：https://stackoverflow.com/questions/40894739/dataproc-jupyter-pyspark-notebook-unable-to-import-graphframes-package

标签

pyspark

jupyter

google-cloud-dataproc

graphframes