问题
I have a DataFrame
in pandas with information about people location in time. It is about 300+ million rows.
Here is the sample where each Name
is assigned to a unique index
by group.by
and sorted
by "Name" and "Year":
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John','Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Name_Grouped_Index'] = df.groupby(['Name']).ngroup()
df = df.sort_values(['Name', 'Year'], ascending=[False, True])
print (df)
Name Year Address Name_Grouped_Index
5 Steve 2018 Canada 1
15 Steve 2018 NewYork 1
16 Steve 2018 California 1
6 Steve 2019 Canada 1
7 Steve 2019 Canada 1
8 Steve 2020 California 1
9 Steve 2020 Canada 1
13 Steve 2021 California 1
14 Steve 2022 California 1
17 Steve 2022 NewYork 1
0 John 2018 Beverly hills 0
1 John 2018 Beverly hills 0
2 John 2019 Beverly hills 0
3 John 2019 Orange county 0
4 John 2019 NewYork 0
10 John 2020 Canada 0
11 John 2021 Canada 0
12 John 2021 Beverly hills 0
Thanks to @MarcusRenshaw I am now able to get the network graph matrix (adjacency matrix) in order to see the total of changes between Addresses. In other words, for example, how many times people moved from “Canada” to “California”. The solution for that can be found HERE.
Here is a NumPy
Array that I get as the "Network Matrix" from the solution above:
['Canada', 'NewYork', 'California', 'Beverly hills', 'Orange county']
[[2 1 2 1 0]
[1 0 1 0 0]
[2 1 1 0 0]
[0 0 0 2 1]
[0 1 0 0 0]]
What I want is to plot
the Network Matrix NumPy Array with the following characteristics:
- Directed graph network with arrows (direction) between nodes.
- A node can have an edge to itself as I have pairs like "Canada-Canada" which is important to show.
- Node size represents the number of incoming edge/link. More links coming the bigger the node size.
- edge/link thickness represents the iteration of the change between two nodes (location). Thicker the edge means higher volumes of location change between nodes.
来源:https://stackoverflow.com/questions/61325124/plotting-the-graph-in-networkx-from-the-numpy-array