问题
Given a pandas dataframe in the following format:
toy = pd.DataFrame({
'id': [1,2,3,
1,2,3,
1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13',
'2016-02-12', '2016-02-12', '2016-02-12',
'2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165,
144, 305, 293,
23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)
The my_metric
column contains some (random) metric I wish to compute a time-dependent moving average of, conditional on the column id
and within some specified time interval that I specify myself. I will refer to this time interval as the "lookback time"; which could be 5 minutes
or 2 years. To determine which observations that are to be included in the lookback calculation, we use the date
column (which could be the index if you prefer).
To my frustration, I have discovered that such a procedure is not easily performed using pandas builtins, since I need to perform the calculation conditionally
on id
and at the same time the calculation should only be made on observations within the lookback time (checked using the date
column). Hence, the output dataframe should consist of one row for each id
-date
combination, with the my_metric
column now being the average of all observations that is contatined within the lookback time (e.g. 2 years, including today's date).
For clarity, I have included a figure with the desired output format (apologies for the oversized figure) when using a 2-year lookback time:
I have a solution but it does not make use of specific pandas built-in functions and is likely sub-optimal (combination of list comprehension and a single for-loop). The solution I am looking for will not make use of a for-loop, and is thus more scalable/efficient/fast.
Thank you!
回答1:
Calculating lookback time: (Current_year - 2 years)
from dateutil.relativedelta import relativedelta
from dateutil import parser
import datetime
In [1691]: dt = '2018-01-01'
In [1695]: dt = parser.parse(dt)
In [1696]: lookback_time = dt - relativedelta(years=2)
Now, filter the dataframe on lookback time and calculate rolling average
In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)
In [1674]: toy.sort_values('id')
Out[1674]:
date id my_metric new_metric
0 2015-05-13 1 395 395.0
3 2016-02-12 1 144 144.0
6 2018-07-23 1 23 83.5
1 2015-05-13 2 634 634.0
4 2016-02-12 2 305 305.0
7 2018-07-23 2 395 350.0
2 2015-05-13 3 165 165.0
5 2016-02-12 3 293 293.0
8 2018-07-23 3 242 267.5
回答2:
So, after some tinkering I found an answer that will generalize adequately. I used a slightly different 'toy' dataframe (slightly more relevant to my case). For completeness sake, here is the data:
Consider now the following code:
# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
xt.index = xt.index.droplevel(0)
return xt
dt='730D' # rolling average window: 730 days = 2 years
# Group by the 'id' column
g = toy.groupby('id')
# Apply the custom function
df = g.apply(rolling_average, dt=dt)
# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])
The result is as expected:
来源:https://stackoverflow.com/questions/53609003/moving-average-on-pandas-groupby-object-that-respects-time