pandas GroupBy and cumulative mean of previous rows in group

前端 未结 2 1386
太阳男子
太阳男子 2021-01-13 23:12

I have a dataframe which looks like this:

pd.DataFrame({\'category\': [1,1,1,2,2,2,3,3,3,4],
              \'order_sta         


        
相关标签:
2条回答
  • 2021-01-13 23:16

    "create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):

    df['mean'] = (
        df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
    df
       category  order_start  time  mean
    0         1            1     1   NaN
    1         1            2     4   1.0
    2         1            3     3   2.5
    3         2            1     6   NaN
    4         2            2     8   6.0
    5         2            3    17   7.0
    6         3            1    14   NaN
    7         3            2    12  14.0
    8         3            3    13  13.0
    9         4            1    16   NaN
    

    Another way to calculate this is without the apply (chaining two groupby calls):

    df['mean'] = (
        df.groupby('category')['time']
          .shift()
          .groupby(df['category'])
          .expanding()
          .mean()
          .to_numpy())  # replace to_numpy() with `.values` for pd.__version__ < 0.24
    df
       category  order_start  time  mean
    0         1            1     1   NaN
    1         1            2     4   1.0
    2         1            3     3   2.5
    3         2            1     6   NaN
    4         2            2     8   6.0
    5         2            3    17   7.0
    6         3            1    14   NaN
    7         3            2    12  14.0
    8         3            3    13  13.0
    9         4            1    16   NaN
    

    In terms of performance, it really depends on the number and size of your groups.

    0 讨论(0)
  • 2021-01-13 23:37

    Inspired by my answer here, one can define a function first:

    def mean_previous(df, Category, Order, Var):
        # Order the dataframe first 
        df.sort_values([Category, Order], inplace=True)
    
        # Calculate the ordinary grouped cumulative sum 
        # and then substract with the grouped cumulative sum of the last order
        csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
    
        # Calculate the ordinary grouped cumulative count 
        # and then substract with the grouped cumulative count of the last order
        ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
    
        return csp / ccp
    

    And the desired column is

    df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
    

    Performance-wise, I believe it's very fast.

    0 讨论(0)
提交回复
热议问题