I have dataset which has more than 50k nodes and I am trying to extract possible edges and communities from them. I did try using some graph tools like gephi, cytoscape, s
Use Pandas to get the data into a pairwise node listing, where each row represents an edge, based on your edge criteria. Then migrate into a networkx
object for graph analysis.
The criteria for two nodes sharing an edge include:
gps1
AND gps2
. groupby
approach I've taken here if you want to apply additional temporal conditions on edges.Since we want to manipulate data based on timestamps, convert start
and end
to datetime
dtype
:
df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")
df.start.describe()
count 35
unique 11
top 2004-01-05 00:00:13
freq 8
first 2004-01-05 00:00:01
last 2004-01-05 00:00:26
Name: start, dtype: object
df.head()
ID start end gps1 gps2
0 0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03 819251 440006
1 00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10 819213 439954
2 00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40 817526 439458
3 00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50 817558 439525
4 00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25 817558 439525
The sample observations happen within a few seconds of each other, so we'll set the grouping frequency to be only a few seconds:
near = "5s"
Now groupby
location and start time to find connected nodes:
edges = (df.groupby(["gps1",
"gps2",
pd.Grouper(key="start",
freq=near,
closed="right",
label="right")],
as_index=False)
.agg({"ID":','.join,
"start":"min",
"end":"max"})
.reset_index()
.rename(columns={"index":"edge",
"start":"start_min",
"end":"end_max"})
)
edges.ID = edges.ID.str.split(",")
edges.head()
:
edge gps1 gps2 ID \
0 0 817526 439458 [00904b4557d3]
1 1 817558 439525 [00022de73863, 00904b14b494, 00904b14b494, 009...
2 2 817558 439525 [00022de73863, 00904b14b494, 00904b312d9e]
3 3 817721 439564 [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...
4 4 817735 439757 [003065d2d8b6, 00904b0c7856]
start_min end_max
0 2004-01-05 00:00:03 2004-01-05 00:18:40
1 2004-01-05 00:00:04 2004-01-05 01:16:50
2 2004-01-05 00:00:25 2004-01-05 00:01:19
3 2004-01-05 00:00:13 2004-01-05 00:02:42
4 2004-01-05 00:00:17 2004-01-05 01:52:40
Each row now represents a unique edge category. ID
is a list of nodes in that all share that edge. It's a bit tricky to get this list into new structure of node-pairs; I've resorted to some old-fashioned nested for-loops. There's likely some Pandas-fu that can improve efficiency here:
Note: In the case of a singleton node, I've assigned a None
value to its pair. If you don't want to track singletons, just ignore the if not len(combos): ...
logic.
pairs = []
idx = 0
for e in edges.edge.values:
nodes = edges.loc[edges.edge==e, "ID"].values[0]
attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
combos = list(combinations(nodes, 2))
if not len(combos):
pair = [e, nodes[0], None]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
else:
for combo in combos:
pair = [e, combo[0], combo[1]]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)
pairs_df.head()
:
edge nodeA nodeB gps1 gps2 start_min \
0 0 00904b4557d3 None 817526 439458 2004-01-05 00:00:03
1 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
2 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
3 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
4 1 00904b14b494 00904b14b494 817558 439525 2004-01-05 00:00:04
end_max
0 2004-01-05 00:18:40
1 2004-01-05 01:16:50
2 2004-01-05 01:16:50
3 2004-01-05 01:16:50
4 2004-01-05 01:16:50
Now the data can be fit to a networkx
object:
import networkx as nx
g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)
# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')
For community detection, there are several options. Consider the networkx community algorithms, as well as the community module, which builds off of native networkx
functionality.
I read your question as mainly concerned with manipulating your data into a format suitable for network analysis. As this answer is lengthy enough already, I'll leave it to you to pursue community detection strategies - several methods can be used out-of-the-box with the modules I've linked to here.