Moving average on pandas.groupby object that respects time

问题

Given a pandas dataframe in the following format:

toy = pd.DataFrame({
'id': [1,2,3,
       1,2,3,
       1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13', 
         '2016-02-12', '2016-02-12', '2016-02-12', 
         '2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165, 
              144, 305, 293, 
              23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)

The my_metric column contains some (random) metric I wish to compute a time-dependent moving average of, conditional on the column id and within some specified time interval that I specify myself. I will refer to this time interval as the "lookback time"; which could be 5 minutes or 2 years. To determine which observations that are to be included in the lookback calculation, we use the date column (which could be the index if you prefer).

To my frustration, I have discovered that such a procedure is not easily performed using pandas builtins, since I need to perform the calculation conditionally on id and at the same time the calculation should only be made on observations within the lookback time (checked using the date column). Hence, the output dataframe should consist of one row for each id-date combination, with the my_metric column now being the average of all observations that is contatined within the lookback time (e.g. 2 years, including today's date).

For clarity, I have included a figure with the desired output format (apologies for the oversized figure) when using a 2-year lookback time:

I have a solution but it does not make use of specific pandas built-in functions and is likely sub-optimal (combination of list comprehension and a single for-loop). The solution I am looking for will not make use of a for-loop, and is thus more scalable/efficient/fast.

Thank you!

回答1:

Calculating lookback time: (Current_year - 2 years)

from dateutil.relativedelta import relativedelta
from dateutil import parser
import datetime

In [1691]: dt = '2018-01-01'

In [1695]: dt = parser.parse(dt)

In [1696]: lookback_time = dt - relativedelta(years=2)

Now, filter the dataframe on lookback time and calculate rolling average

In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)

In [1674]: toy.sort_values('id')
Out[1674]: 
        date  id  my_metric  new_metric
0 2015-05-13   1        395       395.0
3 2016-02-12   1        144       144.0
6 2018-07-23   1         23        83.5
1 2015-05-13   2        634       634.0
4 2016-02-12   2        305       305.0
7 2018-07-23   2        395       350.0
2 2015-05-13   3        165       165.0
5 2016-02-12   3        293       293.0
8 2018-07-23   3        242       267.5

回答2:

So, after some tinkering I found an answer that will generalize adequately. I used a slightly different 'toy' dataframe (slightly more relevant to my case). For completeness sake, here is the data:

Consider now the following code:

# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
    xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
    xt.index = xt.index.droplevel(0)
    return xt

dt='730D' # rolling average window: 730 days = 2 years

# Group by the 'id' column
g = toy.groupby('id')

# Apply the custom function
df = g.apply(rolling_average, dt=dt)

# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])

The result is as expected:

来源：https://stackoverflow.com/questions/53609003/moving-average-on-pandas-groupby-object-that-respects-time

标签

python

pandas

moving-average