efficiently calculating connected components in pyspark
问题 I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city. City | SRC | DEST Houston Kyle -> Benny Houston Benny -> Charles Houston Charles -> Denny Omaha Carol -> Brian etc. I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so? edit: I thought I could do something like select connected_components