问题
When applying broadcast variable with collectasmap(), not all the values are included by broadcast variable. e.g.
val emp = sc.textFile("...text1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("...text2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
emp_new.foreach(println)
val emp_newBC = sc.broadcast(emp_new.collectAsMap())
println(emp_newBC.value)
When i checked the values within emp_newBC I saw that not all the data from emp_new appear. What am i missing?
Thanks in advance.
回答1:
The problem is that emp_new is a collection of tuples, while emp_newBC is a broadcasted map. If you are collecting map, the duplicate keys are being removed and therefore you have less data. If you want to get back a list of all tuples, use
val emp_newBC = sc.broadcast(emp_new.collect())
来源:https://stackoverflow.com/questions/32691591/broadcast-variable-fails-to-take-all-data