Spark: Incremental collect() to a partition causes OutOfMemory in Heap

岁酱吖の 提交于 2020-01-03 02:10:55

问题


I have a following code. Essentially I need to print RDD to a console, so I am collecting large RDD in smaller chunks by collecting it per partition. This is to avoid collecting entire RDD at once. When monitoring the heap and GC log, it seems like nothing is ever being GC'd. Heap keeps on growing until it hits OutOfMemory error. If my understanding is correct, in below once println statement is executed for collected RDDs, they won't be needed so it is safe to GC, but that's not what I see in the GC log each call to collect accumulates until OOM. Does anyone know why collected data is not being GC'd?

 val writes = partitions.foreach { partition =>
      val rddPartition = rdds.mapPartitionsWithIndex ({ 
        case (index, data) => if (index == partition.index) data else Iterator[Words]()
      }, false).collect().toSeq
      val partialReport = Report(rddPartition, reportId, dateCreated)
      println(partialReport.name) 
    }

回答1:


If your dataset is huge, most probably master node cant handle it and would shutdown. You may try writing them to file (eg saveAsTextFile), then read each file again




回答2:


collect() creates an array containing all the elements of the RDD.

It cannot be garbage collected before it is fully created! Hence the OOM.

There can be way around it depending on what Report is actually doing.



来源:https://stackoverflow.com/questions/35046692/spark-incremental-collect-to-a-partition-causes-outofmemory-in-heap

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!