Dataproc: Jupyter pyspark notebook unable to import graphframes package

时光毁灭记忆、已成空白 提交于 2019-12-11 07:58:33

问题


In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.

Pyspark kernel config:

PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'

Following is the cmd to initialize cluster :

gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions  gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m     --worker-machine-type=n1-standard-4  --master-machine-type=n1-standard-4

回答1:


This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.

The suggested workaround is adding

import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))

before your import.




回答2:


I found another way to do add packages which works on Jupyter notebook:

spark = SparkSession.builder \
.appName("Python Spark SQL") \    \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()



回答3:


If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.

More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html




回答4:


The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached

Just open your terminal and set the two environment variables and start pyspark with the graphframes package

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command



来源:https://stackoverflow.com/questions/40894739/dataproc-jupyter-pyspark-notebook-unable-to-import-graphframes-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!