I have a dataframe which looks like this:
pd.DataFrame({\'category\': [1,1,1,2,2,2,3,3,3,4],
\'order_sta
"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding
(and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply
(chaining two groupby
calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.
Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.