发表新帖

发表新帖

Spark Container & Executor OOMs during `reduceByKey`

后端未结

关注

 1  719

故里飘歌 2021-02-09 23:04

I\'m running a Spark job on Amazon\'s EMR in client mode with YARN, using pyspark, to process data from two input files (totaling 200 GB) in size.

The job joins the dat

1条回答

夕颜 (楼主)

2021-02-09 23:39

In case anyone discovers this, the problem turned out to result from data skew. I discovered this by switching our initial combining of the two input files to use a Dataframe join rather than an RDD union. This resulted in a more understandable error which showed that our shuffle failed trying to retrieve data. To solve this, I partitioned our data around an evenly distributed key and then everything worked.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题