问题
I am new to PySpark and EMR.
I am trying to access Spark running on EMR cluster through Jupyter notebook, but running into errors.
I am generating SparkSession using following code:
spark = SparkSession.builder \
.master("local[*]")\
.appName("Carbon - SingleWell parallelization on Spark")\
.getOrCreate()
Tried following to access Remote cluster, but it errored out:
spark = SparkSession.builder \
.master("spark://<remote-emr-ec2-hostname>:7077")\
.appName("Carbon - SingleWell parallelization on Spark")\
.getOrCreate()
Error:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Any help resolving this would be much appreciated.
回答1:
EMR clusters have Jupyter and JupyterHub provisioned for you since EMR version 5.14.0.
Most likely, it is easier to tune those provisioned services up with some extra bootstrap actions than to wire up your local process to talk to the EMR master node.
来源:https://stackoverflow.com/questions/44800857/jupyter-emr-spark-connect-to-emr-cluster-from-jupyter-notebook-on-local-ma