Spark Container & Executor OOMs during `reduceByKey`

后端 未结 1 704
故里飘歌
故里飘歌 2021-02-09 23:04

I\'m running a Spark job on Amazon\'s EMR in client mode with YARN, using pyspark, to process data from two input files (totaling 200 GB) in size.

The job joins the dat

1条回答
  •  夕颜
    夕颜 (楼主)
    2021-02-09 23:39

    In case anyone discovers this, the problem turned out to result from data skew. I discovered this by switching our initial combining of the two input files to use a Dataframe join rather than an RDD union. This resulted in a more understandable error which showed that our shuffle failed trying to retrieve data. To solve this, I partitioned our data around an evenly distributed key and then everything worked.

    0 讨论(0)
提交回复
热议问题