问题
I am passing a broadcast variable to all my executors using the following code. The code seems to work, but I don't know if my approach is good enough. Just want to see if anyone has any better suggestions. Thank you very much!
val myRddMap = sc.textFile("input.txt").map(t => myParser.parse(t))
val myHashMapBroadcastVar = sparkContext.broadcast(myRddMap.collect().toMap)
where myRddMap
is of type org.apache.spark.rdd.RDD[(String, (String, String))]
Then I have a utility function which I pass in RDDs and variables like:
val myOutput = myUtiltityFunction.process(myRDD1, myHashMapBroadcastVar)
So is above code a good way for handling broadcast variables? Or is there any better approach? Thanks!
回答1:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Broadcast variables are actually sent to all nodes. So it doesn't matter that you use those in a utility function, or anywhere. As for as I think you are doing the right thing, nothing seems wrong that resulted in a poor performance.
来源:https://stackoverflow.com/questions/31033367/spark-passing-broadcast-variable-to-executors