Understanding huge shuffle spill sizes in spark

问题

With Spark 2.3 I'm running the following code:

rdd
.persist(DISK_ONLY) // this is 3GB according to storage tab
.groupBy(_.key)
.mapValues(iter => iter.map(x => CaseClass(x._1, x._2)))
.mapValues(iter => func(iter))

I have a sql dataframe of 300M rows
I convert it to RDD, then persist it: storage tab indicates it's 3GB
I do a groupBy. One of my key is receing 100M items, so roughly 1GB if I go by the RDD size
I map each item after the shuffle to a case class. This case class only has 2 "double" fields
I'm sending the full iterator containing all of a partition's data to a function that will process this stream

What I observe is that the task that is processing the 100M of case class is always failing after 1h+ of processing. In the "aggregated metrics by executor" tab in UI I see HUGE values for "shuffle spill" column, around 10 GB, which is 3 times more than the size of the full RDD. . When I do a thread dump of the slow executor, it seems stuck into write/read to disk operations.

Can somebody tell me what's going on? I understand that 100M of case class instances is probably too big to fit into a single executor's RAM, but I don't understand the following:

1) isn't Spark supposed to "stream" all the instances into my func function? why is it trying to store everything on receiving the executor node?

2) Where does the memory blow-up comes from? I don't understand why serializing 100M case class instances should take around 10GB, which is roughly 100 bytes per item (assuming the data that is spilled to disk is the CaseClass instances, I'm not sure from my job at which point the data is spilled)

来源：https://stackoverflow.com/questions/53622577/understanding-huge-shuffle-spill-sizes-in-spark

标签

apache-spark