Count row value change for each group in pandas DataFrame

痴心易碎 提交于 2020-05-08 18:16:10

问题


I have a DataFrame in pandas with information about people location in time. It is about 300+ million rows.

Sample:

import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
print (df)

Output:

          Address   Name  Year
0   Beverly hills   John  2018
1   Beverly hills   John  2018
2   Beverly hills   John  2019
3   Orange county   John  2019
4        New York   John  2019
5          Canada  Steve  2018
6          Canada  Steve  2019
7          Canada  Steve  2019
8      California  Steve  2020
9          Canada  Steve  2020
10         Canada   John  2020
11         Canada   John  2021
12  Beverly hills   John  2021
13     California  Steve  2021
14     California  Steve  2022
15        NewYork  Steve  2018
16     California  Steve  2018
17        NewYork  Steve  2022

I want to calculate the total of changes between Addresses in a specific Year. Or in other words, how many times people moved from “Canada” to “California” in 2018.

Ideal Outputs:

1) Matrix as below for each year. Example: all address changes in the year 2019 (including 2018 to 2019).

+---------------+---------------+---------------+----------+------------+
| From\ To      | Beverly hills | Orange county | New York | California |
+---------------+---------------+---------------+----------+------------+
| Beverly hills | 0             | 1             | 0        | 0          |
+---------------+---------------+---------------+----------+------------+
| Orange county | 0             | 0             | 1        | 0          |
+---------------+---------------+---------------+----------+------------+
| New York      | 0             | 2             | 0        | 0          |
+---------------+---------------+---------------+----------+------------+
| California    | 0             | 0             | 0        | 0          |
+---------------+---------------+---------------+----------+------------+

2) Address change for all years.

+---------------+---------------+------+------+------+
| Address 1     | Address 2     | 2018 | 2019 | 2020 |
+---------------+---------------+------+------+------+
| Beverly hills | Orange county | 0    | 1    | 0    |
+---------------+---------------+------+------+------+
| New York      | Canada        | 0    | 0    | 1    |
+---------------+---------------+------+------+------+
| Canada        | New York      | 1    | 0    | 0    |
+---------------+---------------+------+------+------+
| California    | Canada        | 0    | 1    | 2    |
+---------------+---------------+------+------+------+

My solution so far: Thanks to @QuangHoang I can capture the change of “Year” and change of “Address” with the following code:

groups = df.groupby('Name')

for col in ['Year', 'Address']:
    df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)

groups[col].shift() shifts the corresponding column by 1 within each name. fillna(df[col] fills the first row in each (shifted) group with the original, indicating no change. Finally, ne(df[col]) compares the shifted values with the original values for changes.

Yields:

+----+---------------+-------+------+----------+-------------+
| ID | Address       | Name  | Year | cng-Year | cng-Address |
+----+---------------+-------+------+----------+-------------+
| 0  | Beverly hills | John  | 2018 | 0        | 0           |
+----+---------------+-------+------+----------+-------------+
| 1  | Beverly hills | John  | 2018 | 0        | 0           |
+----+---------------+-------+------+----------+-------------+
| 2  | Beverly hills | John  | 2019 | 1        | 0           |
+----+---------------+-------+------+----------+-------------+
| 3  | Orange county | John  | 2019 | 0        | 1           |
+----+---------------+-------+------+----------+-------------+
| 4  | New York      | John  | 2019 | 0        | 1           |
+----+---------------+-------+------+----------+-------------+
| 10 | Canada        | John  | 2020 | 1        | 1           |
+----+---------------+-------+------+----------+-------------+
| 11 | Canada        | John  | 2021 | 1        | 0           |
+----+---------------+-------+------+----------+-------------+
| 12 | Beverly hills | John  | 2021 | 0        | 1           |
+----+---------------+-------+------+----------+-------------+
| 5  | Canada        | Steve | 2018 | 0        | 0           |
+----+---------------+-------+------+----------+-------------+
| 15 | NewYork       | Steve | 2018 | 1        | 1           |
+----+---------------+-------+------+----------+-------------+
| 16 | California    | Steve | 2018 | 0        | 1           |
+----+---------------+-------+------+----------+-------------+
| 6  | Canada        | Steve | 2019 | 1        | 0           |
+----+---------------+-------+------+----------+-------------+
| 7  | Canada        | Steve | 2019 | 0        | 0           |
+----+---------------+-------+------+----------+-------------+
| 8  | California    | Steve | 2020 | 1        | 1           |
+----+---------------+-------+------+----------+-------------+
| 9  | Canada        | Steve | 2020 | 0        | 1           |
+----+---------------+-------+------+----------+-------------+
| 13 | California    | Steve | 2021 | 1        | 1           |
+----+---------------+-------+------+----------+-------------+
| 14 | California    | Steve | 2022 | 1        | 0           |
+----+---------------+-------+------+----------+-------------+
| 17 | NewYork       | Steve | 2022 | 1        | 1           |
+----+---------------+-------+------+----------+-------------+

回答1:


If I understood the problem..

df.drop_duplicates().groupby(['Name','Year']).size().reset_index(name="changes")

With this output

    Name  Year  changes
0   John  2018        1
1   John  2019        3
2   John  2020        1
3   John  2021        2
4  Steve  2018        3
5  Steve  2019        1
6  Steve  2020        2
7  Steve  2021        1
8  Steve  2022        2


来源:https://stackoverflow.com/questions/61216389/count-row-value-change-for-each-group-in-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!