How to speed up creation of rolling sum (LTM) in pandas with large dataset?

爷,独闯天下 提交于 2019-12-13 03:24:17

问题


I want to calculate the moving sum (rolling twelve months) of daily sales for a dataset with 400k rows and 7 columns. My current approach appears to work but is pretty slow (between 1-2 minutes).

Columns include: date (daily entries), country, item name (product), customer city, customer number (ID) and customer name

As other datasets I work with are much larger (2+ million rows and more) it would be great if you have suggestions on how to speed up the current code:

import pandas as pd
import pyarrow.parquet as pq

# import dataset with 300k rows as pandas dataframe
df = pq.read_table('C:/test_cube_300k.parquet').to_pandas(strings_to_categorical=True)

# list for following groupby
list_groupby = [
    "country",
    "item_name",
    "customer_city",
    "customer_number",
    "customer_name"
    ]

# aggregate daily values to monthly view and resample to add months if months are missing (e.g. January and March with entries but February is missing
df_ltm = df.set_index('date').groupby(list_groupby)["sales"].resample("M").sum()

df_ltm = df_ltm.reset_index()
df_ltm = df_ltm.set_index('date')
df_ltm.sort_index(inplace=True)

# rolling twelve months sum accounting for all specifications/columns via groupby, window = 12 months, frequency = months, min_periods = 12
df_ltm = df_ltm.groupby(list_groupby)['sales'].rolling(window=12, freq='M', min_periods=12).sum().fillna(0)

df_ltm = df_ltm.reset_index().sort_index()

来源:https://stackoverflow.com/questions/56138664/how-to-speed-up-creation-of-rolling-sum-ltm-in-pandas-with-large-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!