How to Distribute Multiprocessing Pool to Spark Workers

后端未结

关注

 1  872

I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR

相关标签:

1条回答

天命终不由人

2021-01-21 05:50
In a closer read of the SparkSession.Builder API Docs, the string passed to the SparkSession.builder.master('xxxx') is the host in the connection to the master node via: spark://xxxx:7077. Like user8371915 said, I needed to not be on a standalone local master. Instead this fix worked like a charm:
```
SparkSession.builder.master('yarn')
```
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#master-java.lang.String-
0 讨论(0)
发布评论:

提交评论
- 加载中...