发表新帖

发表新帖

Spark Container & Executor OOMs during `reduceByKey`

后端未结

关注

 1  713

I\'m running a Spark job on Amazon\'s EMR in client mode with YARN, using pyspark, to process data from two input files (totaling 200 GB) in size.

The job joins the dat

相关标签:

1条回答

夕颜

2021-02-09 23:39

In case anyone discovers this, the problem turned out to result from data skew. I discovered this by switching our initial combining of the two input files to use a Dataframe join rather than an RDD union. This resulted in a more understandable error which showed that our shuffle failed trying to retrieve data. To solve this, I partitioned our data around an evenly distributed key and then everything worked.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题