How to Distribute Multiprocessing Pool to Spark Workers

后端 未结 1 863
忘了有多久
忘了有多久 2021-01-21 05:02

I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR

相关标签:
1条回答
  • 2021-01-21 05:50

    In a closer read of the SparkSession.Builder API Docs, the string passed to the SparkSession.builder.master('xxxx') is the host in the connection to the master node via: spark://xxxx:7077. Like user8371915 said, I needed to not be on a standalone local master. Instead this fix worked like a charm:

    SparkSession.builder.master('yarn')
    

    https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#master-java.lang.String-

    0 讨论(0)
提交回复
热议问题