Apply multiple functions to multiple groupby columns

前端 未结 7 2087
春和景丽
春和景丽 2020-11-22 03:16

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped[\'D\'].a         


        
相关标签:
7条回答
  • 2020-11-22 03:57

    The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

    If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

    df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
    df['group'] = [0, 0, 1, 1]
    df
    
              a         b         c         d  group
    0  0.418500  0.030955  0.874869  0.145641      0
    1  0.446069  0.901153  0.095052  0.487040      0
    2  0.843026  0.936169  0.926090  0.041722      1
    3  0.635846  0.439175  0.828787  0.714123      1
    

    A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

    df.groupby('group').agg({'a':['sum', 'max'], 
                             'b':'mean', 
                             'c':'sum', 
                             'd': lambda x: x.max() - x.min()})
    
                  a                   b         c         d
                sum       max      mean       sum  <lambda>
    group                                                  
    0      0.864569  0.446069  0.466054  0.969921  0.341399
    1      1.478872  0.843026  0.687672  1.754877  0.672401
    

    If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

    def max_min(x):
        return x.max() - x.min()
    
    max_min.__name__ = 'Max minus Min'
    
    df.groupby('group').agg({'a':['sum', 'max'], 
                             'b':'mean', 
                             'c':'sum', 
                             'd': max_min})
    
                  a                   b         c             d
                sum       max      mean       sum Max minus Min
    group                                                      
    0      0.864569  0.446069  0.466054  0.969921      0.341399
    1      1.478872  0.843026  0.687672  1.754877      0.672401
    

    Using apply and returning a Series

    Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

    I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

    def f(x):
        d = {}
        d['a_sum'] = x['a'].sum()
        d['a_max'] = x['a'].max()
        d['b_mean'] = x['b'].mean()
        d['c_d_prodsum'] = (x['c'] * x['d']).sum()
        return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
    
    df.groupby('group').apply(f)
    
             a_sum     a_max    b_mean  c_d_prodsum
    group                                           
    0      0.864569  0.446069  0.466054     0.173711
    1      1.478872  0.843026  0.687672     0.630494
    

    If you are in love with MultiIndexes, you can still return a Series with one like this:

        def f_mi(x):
            d = []
            d.append(x['a'].sum())
            d.append(x['a'].max())
            d.append(x['b'].mean())
            d.append((x['c'] * x['d']).sum())
            return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                       ['sum', 'max', 'mean', 'prodsum']])
    
    df.groupby('group').apply(f_mi)
    
                  a                   b       c_d
                sum       max      mean   prodsum
    group                                        
    0      0.864569  0.446069  0.466054  0.173711
    1      1.478872  0.843026  0.687672  0.630494
    
    0 讨论(0)
  • 2020-11-22 04:01

    Pandas >= 0.25.0, named aggregations

    Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:

    Example:

    df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
    df['group'] = [0, 0, 1, 1]
    
              a         b         c         d  group
    0  0.521279  0.914988  0.054057  0.125668      0
    1  0.426058  0.828890  0.784093  0.446211      0
    2  0.363136  0.843751  0.184967  0.467351      1
    3  0.241012  0.470053  0.358018  0.525032      1
    

    Apply GroupBy.agg with named aggregation:

    df.groupby('group').agg(
                 a_sum=('a', 'sum'),
                 a_mean=('a', 'mean'),
                 b_mean=('b', 'mean'),
                 c_sum=('c', 'sum'),
                 d_range=('d', lambda x: x.max() - x.min())
    )
    
              a_sum    a_mean    b_mean     c_sum   d_range
    group                                                  
    0      0.947337  0.473668  0.871939  0.838150  0.320543
    1      0.604149  0.302074  0.656902  0.542985  0.057681
    
    0 讨论(0)
  • 2020-11-22 04:03

    For the first part you can pass a dict of column names for keys and a list of functions for the values:

    In [28]: df
    Out[28]:
              A         B         C         D         E  GRP
    0  0.395670  0.219560  0.600644  0.613445  0.242893    0
    1  0.323911  0.464584  0.107215  0.204072  0.927325    0
    2  0.321358  0.076037  0.166946  0.439661  0.914612    1
    3  0.133466  0.447946  0.014815  0.130781  0.268290    1
    
    In [26]: f = {'A':['sum','mean'], 'B':['prod']}
    
    In [27]: df.groupby('GRP').agg(f)
    Out[27]:
                A                   B
              sum      mean      prod
    GRP
    0    0.719580  0.359790  0.102004
    1    0.454824  0.227412  0.034060
    

    UPDATE 1:

    Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

    Here's a hacky workaround:

    In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}
    
    In [69]: df.groupby('GRP').agg(f)
    Out[69]:
                A                   B         D
              sum      mean      prod  <lambda>
    GRP
    0    0.719580  0.359790  0.102004  1.170219
    1    0.454824  0.227412  0.034060  1.182901
    

    Here, the resultant 'D' column is made up of the summed 'E' values.

    UPDATE 2:

    Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

    In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()
    
    In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
    
    In [97]: df.groupby('GRP').agg(f)
    Out[97]:
                A                   B         D
              sum      mean      prod   my name
    GRP
    0    0.719580  0.359790  0.102004  0.204072
    1    0.454824  0.227412  0.034060  0.570441
    
    0 讨论(0)
  • 2020-11-22 04:08

    Ted's answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:

    create a dataframe

    df=pd.DataFrame({'a': [1,2,3,4,5,6], 'b': [1,1,0,1,1,0], 'c': ['x','x','y','y','z','z']})
    
    
       a  b  c
    0  1  1  x
    1  2  1  x
    2  3  0  y
    3  4  1  y
    4  5  1  z
    5  6  0  z
    

    grouping and aggregating with apply (using multiple columns)

    df.groupby('c').apply(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())
    
    c
    x    2.0
    y    4.0
    z    5.0
    

    grouping and aggregating with aggregate (using multiple columns)

    I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.

    It seems obvious now, but as long as you don't select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.

    only access to the selected column

    df.groupby('c')['a'].aggregate(lambda x: x[x>1].mean())
    

    access to all columns since selection is after all the magic

    df.groupby('c').aggregate(lambda x: x[(x['a']>1) & (x['b']==1)].mean())['a']
    

    or similarly

    df.groupby('c').aggregate(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())
    

    I hope this helps.

    0 讨论(0)
  • 2020-11-22 04:18

    As an alternative (mostly on aesthetics) to Ted Petrou's answer, I found I preferred a slightly more compact listing. Please don't consider accepting it, it's just a much-more-detailed comment on Ted's answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:

    df.groupby('group') \
      .apply(lambda x: pd.Series({
          'a_sum'       : x['a'].sum(),
          'a_max'       : x['a'].max(),
          'b_mean'      : x['b'].mean(),
          'c_d_prodsum' : (x['c'] * x['d']).sum()
      })
    )
    
              a_sum     a_max    b_mean  c_d_prodsum
    group                                           
    0      0.530559  0.374540  0.553354     0.488525
    1      1.433558  0.832443  0.460206     0.053313
    

    I find it more reminiscent of dplyr pipes and data.table chained commands. Not to say they're better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized def functions for these types of operations. This is just an alternative, not necessarily better.)


    I generated data in the same manner as Ted, I'll add a seed for reproducibility.

    import numpy as np
    np.random.seed(42)
    df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
    df['group'] = [0, 0, 1, 1]
    df
    
              a         b         c         d  group
    0  0.374540  0.950714  0.731994  0.598658      0
    1  0.156019  0.155995  0.058084  0.866176      0
    2  0.601115  0.708073  0.020584  0.969910      1
    3  0.832443  0.212339  0.181825  0.183405      1
    
    0 讨论(0)
  • 2020-11-22 04:19

    New in version 0.25.0.

    To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

    • The keywords are the output column names
    • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
        In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
           ....:                         'height': [9.1, 6.0, 9.5, 34.0],
           ....:                         'weight': [7.9, 7.5, 9.9, 198.0]})
           ....: 
    
        In [80]: animals
        Out[80]: 
          kind  height  weight
        0  cat     9.1     7.9
        1  dog     6.0     7.5
        2  cat     9.5     9.9
        3  dog    34.0   198.0
    
        In [81]: animals.groupby("kind").agg(
           ....:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
           ....:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
           ....:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
           ....: )
           ....: 
        Out[81]: 
              min_height  max_height  average_weight
        kind                                        
        cat          9.1         9.5            8.90
        dog          6.0        34.0          102.75
    

    pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.

        In [82]: animals.groupby("kind").agg(
           ....:     min_height=('height', 'min'),
           ....:     max_height=('height', 'max'),
           ....:     average_weight=('weight', np.mean),
           ....: )
           ....: 
        Out[82]: 
              min_height  max_height  average_weight
        kind                                        
        cat          9.1         9.5            8.90
        dog          6.0        34.0          102.75
    

    Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().

    Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

        In [84]: animals.groupby("kind").height.agg(
           ....:     min_height='min',
           ....:     max_height='max',
           ....: )
           ....: 
        Out[84]: 
              min_height  max_height
        kind                        
        cat          9.1         9.5
        dog          6.0        34.0
    
    0 讨论(0)
提交回复
热议问题