How to find membership of vertices using Graphframes or igraph or networx in pyspark

放肆的年华 提交于 2019-12-25 01:49:01

问题


my input dataframe is df

    valx      valy 
1: 600060     09283744
2: 600131     96733110 
3: 600194     01700001

and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership .

I have tried Graphframes in pyspark and networx library too, but not getting desired results

My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2)

V1               V2
600060           1
96733110         1
01700001         3

I tried below

import networkx as nx
import pandas as pd

filelocation = r'Pathtodataframe df csv'

Panda_edgelist = pd.read_csv(filelocation)

g = nx.from_pandas_edgelist(Panda_edgelist,'valx','valy')
g2 = g.to_undirected(g)
list(g.nodes)
``

回答1:


I'm not sure if you are violating any rules here by asking the same question two times.

To detect communities with graphframes, at first you have to create graphframes object. Give your example dataframe the following code snippet shows you the necessary transformations:

from graphframes import *

sc.setCheckpointDir("/tmp/connectedComponents")


l = [
(  '600060'  , '09283744'),
(  '600131'  , '96733110'),
(  '600194'  , '01700001')
]

columns = ['valx', 'valy']

#this is your input dataframe 
edges = spark.createDataFrame(l, columns)

#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()

#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()

g = GraphFrame(vertices, edges)

Output:

+------+--------+ 
|   src|     dst| 
+------+--------+ 
|600060|09283744| 
|600131|96733110| 
|600194|01700001| 
+------+--------+ 
+--------+ 
|      id| 
+--------+ 
|  600060| 
|  600131| 
|  600194| 
|09283744| 
|96733110| 
|01700001| 
+--------+

You wrote in the comments of your other question that the community detection algorithmus doesn't matter for you currently. Therefore I will pick the connected components:

result = g.connectedComponents()
result.show()

Output:

+--------+------------+ 
|      id|   component| 
+--------+------------+ 
|  600060|163208757248| 
|  600131| 34359738368| 
|  600194|884763262976| 
|09283744|163208757248| 
|96733110| 34359738368| 
|01700001|884763262976| 
+--------+------------+

Other community detection algorithms (like LPA) can be found in the user guide.



来源:https://stackoverflow.com/questions/56494223/using-pyspark-how-to-create-unidirected-graph-using-selected-pairs-from-edge-lis

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!