问题
I want to compute rolling sums group-wise for a large number of groups and I'm having trouble doing it acceptably quickly.
Pandas has build-in methods for rolling and expanding calculations
Here's an example:
import pandas as pd
import numpy as np
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 20
df = pd.DataFrame(
data=np.random.normal(size=obs * k).reshape(obs, k),
index=pd.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)
To get rolling and expanding sums I can use
df.groupby(level=0).expanding().sum()
df.groupby(level=0).rolling(window=5).sum()
But this takes a long time for a very large number of groups. For expanding sums, using instead the pandas method cumsum is almost 60 times quicker (16s vs 280ms for the above example) and turns hours into minutes.
df.groupby(level=0).cumsum()
Is there a fast implementation of rolling sum in pandas, like cumsum is for expanding sums? If not, could I use numpy to accomplish this?
回答1:
I had the same experience with .rolling()
its nice, but only with small datasets or if the function you are applying is non standard, with sum()
I would suggest using cumsum()
and subtracting cumsum().shift(5)
df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)
来源:https://stackoverflow.com/questions/56884977/speeding-up-rolling-sum-calculation-in-pandas-groupby