Spark: passing broadcast variable to executors

心不动则不痛 提交于 2020-01-05 03:10:51

问题


I am passing a broadcast variable to all my executors using the following code. The code seems to work, but I don't know if my approach is good enough. Just want to see if anyone has any better suggestions. Thank you very much!

val myRddMap = sc.textFile("input.txt").map(t => myParser.parse(t))
val myHashMapBroadcastVar = sparkContext.broadcast(myRddMap.collect().toMap)

where myRddMap is of type org.apache.spark.rdd.RDD[(String, (String, String))]

Then I have a utility function which I pass in RDDs and variables like:

val myOutput = myUtiltityFunction.process(myRDD1, myHashMapBroadcastVar)

So is above code a good way for handling broadcast variables? Or is there any better approach? Thanks!


回答1:


Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Broadcast variables are actually sent to all nodes. So it doesn't matter that you use those in a utility function, or anywhere. As for as I think you are doing the right thing, nothing seems wrong that resulted in a poor performance.



来源:https://stackoverflow.com/questions/31033367/spark-passing-broadcast-variable-to-executors

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!