Scala spark, listbuffer is empty

假如想象 提交于 2019-11-26 12:32:53

Apache Spark doesn't provide shared memory therefore here:

dataSet.foreach { e =>
  items += e
  println("len = " + items.length) //1. here length is ok
}

you modify a local copy of items on a respective exectuor. The original items list defined on the driver is not modified. As a result this:

items.foreach { x => print(x) }

executes, but there is nothing to print.

Please check Understanding closures

While it would be recommended here, you could replace items with an accumulator

val acc = sc.collectionAccumulator[String]("Items")
dataSet.foreach(e => acc.add(e))

Spark runs in executers and returns the results. The above code doesn't work as intended. If you need to add the elements from foreach then need to collect the data in the driver and add to the current_set. But collecting the data is a bad idea when you have large data.

val items = new ListBuffer[String]()

val rdd = spark.sparkContext.parallelize(1 to 10, 4)
rdd.collect().foreach(data => items += data.toString())
println(items)

Output:

ListBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!