I\'m running a Spark job on Amazon\'s EMR in client mode with YARN, using pyspark, to process data from two input files (totaling 200 GB) in size.
The job joins the dat
In case anyone discovers this, the problem turned out to result from data skew. I discovered this by switching our initial combining of the two input files to use a Dataframe join rather than an RDD union. This resulted in a more understandable error which showed that our shuffle failed trying to retrieve data. To solve this, I partitioned our data around an evenly distributed key and then everything worked.