I have dataset which has more than 50k nodes and I am trying to extract possible edges and communities from them. I did try using some graph tools like gephi, cytoscape, s
Use Pandas to get the data into a pairwise node listing, where each row represents an edge, based on your edge criteria. Then migrate into a networkx object for graph analysis.
The criteria for two nodes sharing an edge include:
gps1 AND gps2. groupby approach I've taken here if you want to apply additional temporal conditions on edges.Since we want to manipulate data based on timestamps, convert start and end to datetime dtype:
df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")
df.start.describe()
count 35
unique 11
top 2004-01-05 00:00:13
freq 8
first 2004-01-05 00:00:01
last 2004-01-05 00:00:26
Name: start, dtype: object
df.head()
ID start end gps1 gps2
0 0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03 819251 440006
1 00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10 819213 439954
2 00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40 817526 439458
3 00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50 817558 439525
4 00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25 817558 439525
The sample observations happen within a few seconds of each other, so we'll set the grouping frequency to be only a few seconds:
near = "5s"
Now groupby location and start time to find connected nodes:
edges = (df.groupby(["gps1",
"gps2",
pd.Grouper(key="start",
freq=near,
closed="right",
label="right")],
as_index=False)
.agg({"ID":','.join,
"start":"min",
"end":"max"})
.reset_index()
.rename(columns={"index":"edge",
"start":"start_min",
"end":"end_max"})
)
edges.ID = edges.ID.str.split(",")
edges.head():
edge gps1 gps2 ID \
0 0 817526 439458 [00904b4557d3]
1 1 817558 439525 [00022de73863, 00904b14b494, 00904b14b494, 009...
2 2 817558 439525 [00022de73863, 00904b14b494, 00904b312d9e]
3 3 817721 439564 [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...
4 4 817735 439757 [003065d2d8b6, 00904b0c7856]
start_min end_max
0 2004-01-05 00:00:03 2004-01-05 00:18:40
1 2004-01-05 00:00:04 2004-01-05 01:16:50
2 2004-01-05 00:00:25 2004-01-05 00:01:19
3 2004-01-05 00:00:13 2004-01-05 00:02:42
4 2004-01-05 00:00:17 2004-01-05 01:52:40
Each row now represents a unique edge category. ID is a list of nodes in that all share that edge. It's a bit tricky to get this list into new structure of node-pairs; I've resorted to some old-fashioned nested for-loops. There's likely some Pandas-fu that can improve efficiency here:
Note: In the case of a singleton node, I've assigned a None value to its pair. If you don't want to track singletons, just ignore the if not len(combos): ... logic.
pairs = []
idx = 0
for e in edges.edge.values:
nodes = edges.loc[edges.edge==e, "ID"].values[0]
attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
combos = list(combinations(nodes, 2))
if not len(combos):
pair = [e, nodes[0], None]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
else:
for combo in combos:
pair = [e, combo[0], combo[1]]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)
pairs_df.head():
edge nodeA nodeB gps1 gps2 start_min \
0 0 00904b4557d3 None 817526 439458 2004-01-05 00:00:03
1 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
2 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
3 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
4 1 00904b14b494 00904b14b494 817558 439525 2004-01-05 00:00:04
end_max
0 2004-01-05 00:18:40
1 2004-01-05 01:16:50
2 2004-01-05 01:16:50
3 2004-01-05 01:16:50
4 2004-01-05 01:16:50
Now the data can be fit to a networkx object:
import networkx as nx
g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)
# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')
For community detection, there are several options. Consider the networkx community algorithms, as well as the community module, which builds off of native networkx functionality.
I read your question as mainly concerned with manipulating your data into a format suitable for network analysis. As this answer is lengthy enough already, I'll leave it to you to pursue community detection strategies - several methods can be used out-of-the-box with the modules I've linked to here.