How to find membership of vertices using Graphframes or igraph or networx in pyspark

强颜欢笑 提交于 2019-12-02 12:06:39

I'm not sure if you are violating any rules here by asking the same question two times.

To detect communities with graphframes, at first you have to create graphframes object. Give your example dataframe the following code snippet shows you the necessary transformations:

from graphframes import *

sc.setCheckpointDir("/tmp/connectedComponents")


l = [
(  '600060'  , '09283744'),
(  '600131'  , '96733110'),
(  '600194'  , '01700001')
]

columns = ['valx', 'valy']

#this is your input dataframe 
edges = spark.createDataFrame(l, columns)

#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()

#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()

g = GraphFrame(vertices, edges)

Output:

+------+--------+ 
|   src|     dst| 
+------+--------+ 
|600060|09283744| 
|600131|96733110| 
|600194|01700001| 
+------+--------+ 
+--------+ 
|      id| 
+--------+ 
|  600060| 
|  600131| 
|  600194| 
|09283744| 
|96733110| 
|01700001| 
+--------+

You wrote in the comments of your other question that the community detection algorithmus doesn't matter for you currently. Therefore I will pick the connected components:

result = g.connectedComponents()
result.show()

Output:

+--------+------------+ 
|      id|   component| 
+--------+------------+ 
|  600060|163208757248| 
|  600131| 34359738368| 
|  600194|884763262976| 
|09283744|163208757248| 
|96733110| 34359738368| 
|01700001|884763262976| 
+--------+------------+

Other community detection algorithms (like LPA) can be found in the user guide.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!