broadcast variable fails to take all data

别来无恙 提交于 2019-12-24 11:25:49

问题


When applying broadcast variable with collectasmap(), not all the values are included by broadcast variable. e.g.

    val emp = sc.textFile("...text1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
    val emp_new = sc.textFile("...text2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
    emp_new.foreach(println)

    val emp_newBC = sc.broadcast(emp_new.collectAsMap())
    println(emp_newBC.value)

When i checked the values within emp_newBC I saw that not all the data from emp_new appear. What am i missing?

Thanks in advance.


回答1:


The problem is that emp_new is a collection of tuples, while emp_newBC is a broadcasted map. If you are collecting map, the duplicate keys are being removed and therefore you have less data. If you want to get back a list of all tuples, use

val emp_newBC = sc.broadcast(emp_new.collect())



来源:https://stackoverflow.com/questions/32691591/broadcast-variable-fails-to-take-all-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!