问题
I have a DataFrame in pandas with information about people location in time. It is about 300+ million rows.
Here is the sample where each Name is assigned to a unique index
by group.by
and sorted by Name
and Year
:
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])
Output:
+-------+-------+------+---------------+----------------------+
| Index | Name | Year | Address | Name_Grouped_Index |
+-------+-------+------+---------------+----------------------+
| 5 | Steve | 2018 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 15 | Steve | 2018 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 16 | Steve | 2018 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 6 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 7 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 8 | Steve | 2020 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 9 | Steve | 2020 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 13 | Steve | 2021 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 14 | Steve | 2022 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 17 | Steve | 2022 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 0 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 1 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 2 | John | 2019 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 3 | John | 2019 | Orange county | 0 |
+-------+-------+------+---------------+----------------------+
| 4 | John | 2019 | New York | 0 |
+-------+-------+------+---------------+----------------------+
| 10 | John | 2020 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 11 | John | 2021 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 12 | John | 2021 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
I want to get the network graph matrix (adjacency matrix) where to see the total of changes between Addresses. In other words, for example, how many times people moved from “Canada” to “California” in 2018.
Ideal Outputs:
1) A direct graph from the Address column. Technically converting the Address column into two columns "Source" & "Target" where the "Target" value is the "Source" for the next row. Preferably counting the pairs in another column "Weight" instead of pairs being repeated.
+------------+------------+------+--------+
| Source | Target | Year | Weight |
+------------+------------+------+--------+
| Canada | NewYork | 2018 | |
+------------+------------+------+--------+
| NewYork | California | 2018 | |
+------------+------------+------+--------+
| California | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | California | 2020 | |
+------------+------------+------+--------+
| California | Canada | 2020 | |
+------------+------------+------+--------+
| Canada | California | 2021 | |
+------------+------------+------+--------+
| California | California | 2022 | |
+------------+------------+------+--------+
| California | NewYork | 2022 | |
+------------+------------+------+--------+
OR
2) A matrix to illustrate the total changes between addresses.
+---------------+--------+---------+------------+---------------+---------------+
| From \ To | Canada | NewYork | California | Beverly hills | Orange county |
+---------------+--------+---------+------------+---------------+---------------+
| Canada | 2 | 2 | 2 | 2 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| NewYork | 1 | 0 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| California | 2 | 1 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| Beverly hills | 0 | 0 | 0 | 2 | 1 |
+---------------+--------+---------+------------+---------------+---------------+
| Orange county | 0 | 1 | 0 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
回答1:
This is not the prettiest code but at least you can follow each step. I've gone for the second option because you can easily make your graph from this connections matrix. Do you need help with making the networkx graph? The rows and columns of the matrix are : ['Beverly hills', 'Orange county', 'New York', 'Canada', 'California', 'NewYork'] You've spelled newyork differently for each person so it comes up twice.
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])
print (df)
dictionary_ = {} # where each person went
places = [] # all of the places
for index, row in df.iterrows():
if row['Author_Grouped_Index'] not in dictionary_:
dictionary_[row['Author_Grouped_Index']] = []
dictionary_[row['Author_Grouped_Index']].append(row["Address"])
else:
dictionary_[row['Author_Grouped_Index']].append(row["Address"])
if row["Address"] not in places:
places.append(row["Address"])
print (dictionary_)
new_dictionary = {} #number of times each place visited
for key, value in dictionary_.items():
for x in range(len(value)-1):
move = value[x] + "-" + value[x+1]
if not move in new_dictionary:
new_dictionary[move] = 1
else:
new_dictionary[move] += 1
print (new_dictionary)
print (places)
import numpy as np
array = np.zeros((len(places),len(places)), dtype=int)
for x, place in enumerate(places):
for y, place_2 in enumerate(places):
move_2 = (place + "-" + place_2)
try:
array[x,y] = (new_dictionary[move_2])
except:
array[x,y] = 0
print (array)
来源:https://stackoverflow.com/questions/61307877/transforming-pandas-dataframe-column-to-networkx-graph-with-source-and-target