Spark: Incremental collect() to a partition causes OutOfMemory in Heap

问题

I have a following code. Essentially I need to print RDD to a console, so I am collecting large RDD in smaller chunks by collecting it per partition. This is to avoid collecting entire RDD at once. When monitoring the heap and GC log, it seems like nothing is ever being GC'd. Heap keeps on growing until it hits OutOfMemory error. If my understanding is correct, in below once println statement is executed for collected RDDs, they won't be needed so it is safe to GC, but that's not what I see in the GC log each call to collect accumulates until OOM. Does anyone know why collected data is not being GC'd?

 val writes = partitions.foreach { partition =>
      val rddPartition = rdds.mapPartitionsWithIndex ({ 
        case (index, data) => if (index == partition.index) data else Iterator[Words]()
      }, false).collect().toSeq
      val partialReport = Report(rddPartition, reportId, dateCreated)
      println(partialReport.name) 
    }

回答1:

If your dataset is huge, most probably master node cant handle it and would shutdown. You may try writing them to file (eg saveAsTextFile), then read each file again