Speeding up rolling sum calculation in pandas groupby

问题

I want to compute rolling sums group-wise for a large number of groups and I'm having trouble doing it acceptably quickly.

Pandas has build-in methods for rolling and expanding calculations

Here's an example:

import pandas as pd
import numpy as np
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 20
df = pd.DataFrame(
    data=np.random.normal(size=obs * k).reshape(obs, k),
    index=pd.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)

To get rolling and expanding sums I can use

df.groupby(level=0).expanding().sum()
df.groupby(level=0).rolling(window=5).sum()

But this takes a long time for a very large number of groups. For expanding sums, using instead the pandas method cumsum is almost 60 times quicker (16s vs 280ms for the above example) and turns hours into minutes.

df.groupby(level=0).cumsum()

Is there a fast implementation of rolling sum in pandas, like cumsum is for expanding sums? If not, could I use numpy to accomplish this?

回答1:

I had the same experience with .rolling() its nice, but only with small datasets or if the function you are applying is non standard, with sum() I would suggest using cumsum() and subtracting cumsum().shift(5)

df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)

来源：https://stackoverflow.com/questions/56884977/speeding-up-rolling-sum-calculation-in-pandas-groupby

标签

python

pandas

performance

pandas-groupby

rolling-computation

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!