问题
i work on graphs in GraphX. by using the below code i have made a variable to store neighbors of nodes in RDD:
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either)
i used broadcast variable to broadcast neighbors to all slaves by using below code:
val broadcastVar = all_neighbors.collect().toMap
val nvalues = sc.broadcast(broadcastVar)
i want to compute intersection between two nodes neighbors. for example intersection between node 1 and node 2 neighbors.
At first i use this code for computing intersection that uses the broadcast variable nvalues:
val common_neighbors=nvalues.value(1).intersect(nvalues.value(2))
and once i used the below code for computing intersection of two nodes:
val common_neighbors2=(all_neighbors.filter(x=>x._1==1)).intersection(all_neighbors.filter(x=>x._1==2))
my question is this: which one of the above methods is efficient and more distributed and parallel? using the broadcast variable nvalue for computing intersection or using filtering RDD method?
回答1:
I think it depends on the situation.
In the case where your nvalues
size is less and can fit into each executor and driver node, the approach with broadcasting will be optimal as data is cached in executors and this data is not recomputed over and over again. Also, it will save spark a huge communication and compute burden. In such cases, the other approach is not optimal as it might happen that all_neighbours
rdd is calculated every time and this will decrease the performance as there will be a lot of recomputations and will increase computation cost.
In the case where your nvalues
cannot fit into each executor and driver node,
broadcasting will not work as it will throw an error. Hence, there is no option left but to use the second approach though it might still cause performance issues at least code will work!!
Let me know if it helps!!
来源:https://stackoverflow.com/questions/60493554/comparing-intersection-between-two-nodes-using-broadcast-variable-and-using-rdd