问题
I have a following code. Essentially I need to print RDD to a console, so I am collecting large RDD in smaller chunks by collecting it per partition. This is to avoid collecting entire RDD at once. When monitoring the heap and GC log, it seems like nothing is ever being GC'd. Heap keeps on growing until it hits OutOfMemory error. If my understanding is correct, in below once println statement is executed for collected RDDs, they won't be needed so it is safe to GC, but that's not what I see in the GC log each call to collect accumulates until OOM. Does anyone know why collected data is not being GC'd?
val writes = partitions.foreach { partition =>
val rddPartition = rdds.mapPartitionsWithIndex ({
case (index, data) => if (index == partition.index) data else Iterator[Words]()
}, false).collect().toSeq
val partialReport = Report(rddPartition, reportId, dateCreated)
println(partialReport.name)
}
回答1:
If your dataset is huge, most probably master node cant handle it and would shutdown. You may try writing them to file (eg saveAsTextFile), then read each file again
回答2:
collect()
creates an array containing all the elements of the RDD.
It cannot be garbage collected before it is fully created! Hence the OOM.
There can be way around it depending on what Report
is actually doing.
来源:https://stackoverflow.com/questions/35046692/spark-incremental-collect-to-a-partition-causes-outofmemory-in-heap