I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR
In a closer read of the SparkSession.Builder API Docs, the string passed to the SparkSession.builder.master('xxxx')
is the host in the connection to the master node via: spark://xxxx:7077.
Like user8371915 said, I needed to not be on a standalone local master. Instead this fix worked like a charm:
SparkSession.builder.master('yarn')
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#master-java.lang.String-