Pandas - Count frequency of value for last x amount of days

问题

I'm finding some unexpected results. What I am trying to do is create a column that looks at the ID number and the date, and will count how many times that ID number comes up in the last 7 days (I'd also like to make that dynamic for an x amount of days, but just trying out with 7 days).

So given this dataframe:

import pandas as pd



df = pd.DataFrame(
        [['A', '2020-02-02 20:31:00'],
        ['A', '2020-02-03 00:52:00'],
        ['A', '2020-02-07 23:45:00'],
        ['A', '2020-02-08 13:19:00'],
        ['A', '2020-02-18 13:16:00'],
        ['A', '2020-02-27 12:16:00'],
        ['A', '2020-02-28 12:16:00'],
        ['B', '2020-02-07 18:57:00'],
        ['B', '2020-02-07 21:50:00'],
        ['B', '2020-02-12 19:03:00'],
        ['C', '2020-02-01 13:50:00'],
        ['C', '2020-02-11 15:50:00'],
        ['C', '2020-02-21 10:50:00']],
        columns = ['ID', 'Date'])

Code to calculate occurrence in last 7 days for each instance:

df['Date'] = pd.to_datetime(df['Date'])

delta = 7
df['count_in_last_%s_days' %(delta)] = df.groupby(['ID', pd.Grouper(freq='%sD' %delta, key='Date')]).cumcount()

Output:

   ID                Date  count_in_last_7_days
0   A 2020-02-02 20:31:00                     0
1   A 2020-02-03 00:52:00                     1
2   A 2020-02-07 23:45:00                     2
3   A 2020-02-08 13:19:00                     0 #<---- This should output 3
4   A 2020-02-18 13:16:00                     0
5   A 2020-02-27 12:16:00                     0
6   A 2020-02-28 12:16:00                     1
7   B 2020-02-07 18:57:00                     0
8   B 2020-02-07 21:50:00                     1
9   B 2020-02-12 19:03:00                     0 #<---- THIS SHOULD OUTPUT 2
10  C 2020-02-01 13:50:00                     0
11  C 2020-02-11 15:50:00                     0
12  C 2020-02-21 10:50:00                     0

回答1:

Looks like a rolling on Date with correct window will do:

(df.set_index('Date')
   .assign(count_last=1)
   .groupby('ID')
   .rolling(f'{delta}D')
   .sum() - 1
)

Output:

                        count_last
ID Date                           
A  2020-02-02 20:31:00         0.0
   2020-02-03 00:52:00         1.0
   2020-02-07 23:45:00         2.0
   2020-02-08 13:19:00         3.0
   2020-02-18 13:16:00         0.0
   2020-02-27 12:16:00         0.0
   2020-02-28 12:16:00         1.0
B  2020-02-07 18:57:00         0.0
   2020-02-07 21:50:00         1.0
   2020-02-12 19:03:00         2.0
C  2020-02-01 13:50:00         0.0
   2020-02-11 15:50:00         0.0
   2020-02-21 10:50:00         0.0

回答2:

You do not want to use a Grouper on Date but a rolling window. A grouper will segment the dataframe in separate consecutive blocks of the required duration. As you want 7 days from each date, this is the job of rolling:

delta = 7
df['count_in_last_%s_days' %(delta)] = df.assign(count=1).groupby(
    ['ID']).apply(lambda x: x.rolling('%sD' %delta, on='Date').sum(
        ))['count'].astype(int) - 1

it gives as expected:

   ID                Date  count_in_last_7_days
0   A 2020-02-02 20:31:00                     0
1   A 2020-02-03 00:52:00                     1
2   A 2020-02-07 23:45:00                     2
3   A 2020-02-08 13:19:00                     3
4   A 2020-02-18 13:16:00                     0
5   A 2020-02-27 12:16:00                     0
6   A 2020-02-28 12:16:00                     1
7   B 2020-02-07 18:57:00                     0
8   B 2020-02-07 21:50:00                     1
9   B 2020-02-12 19:03:00                     2
10  C 2020-02-01 13:50:00                     0
11  C 2020-02-11 15:50:00                     0
12  C 2020-02-21 10:50:00                     0

来源：https://stackoverflow.com/questions/60617509/pandas-count-frequency-of-value-for-last-x-amount-of-days

标签

python

pandas

datetime

pandas-groupby

rolling-computation