I\'m doing a simple groupBy on a fairly small dataset (80 files in HDFS, few gigs in total). I\'m running Spark on 8 low-memory machines in a yarn cluster, i.e. something along
Patrick Wendell shed some light on the details of the groupBy operator on the mailing list. The takeaway message is the following:
Within a partition things will spill [...] This spilling can only occur across keys at the moment. Spilling cannot occur within a key at present. [...] Spilling within one key for GroupBy's is likely to end up in the next release of Spark, Spark 1.2. [...] If the goal is literally to just write out to disk all the values associated with each group, and the values associated with a single group are larger than fit in memory, this cannot be accomplished right now with the groupBy operator.
He further suggests a work-around:
The best way to work around this depends a bit on what you are trying to do with the data down stream. Typically approaches involve sub-dividing any very large groups, for instance, appending a hashed value in a small range (1-10) to large keys. Then your downstream code has to deal with aggregating partial values for each group. If your goal is just to lay each group out sequentially on disk on one big file, you can call
sortByKey
with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release).