Aggregation in pandas

后端 未结 2 573
攒了一身酷
攒了一身酷 2020-11-22 08:16
  1. How to perform aggregation with pandas?
  2. No DataFrame after aggregation! What happened?
  3. How to aggregate mainly strings columns (to lists
相关标签:
2条回答
  • 2020-11-22 08:39

    If you are coming from an R or SQL background here are 3 examples that will teach you everything you need to do aggregation the way you are already familiar with:

    Let us first create a Pandas dataframe

    import pandas as pd
    
    df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
                       'key2' : ['c','c','d','d','e'],
                       'value1' : [1,2,2,3,3],
                       'value2' : [9,8,7,6,5]})
    
    df.head(5)
    

    Here is how the table we created looks like:

    |----------------|-------------|------------|------------|
    |      key1      |     key2    |    value1  |    value2  |
    |----------------|-------------|------------|------------|
    |       a        |       c     |      1     |       9    |
    |       a        |       c     |      2     |       8    |
    |       a        |       d     |      2     |       7    |
    |       b        |       d     |      3     |       6    |
    |       a        |       e     |      3     |       5    |
    |----------------|-------------|------------|------------|
    

    1. Aggregating With Row Reduction Similar to SQL Group By

    df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'), 
                                             sum_of_value_2=('value2', 'sum'),
                                             count_of_value1=('value1','size')
                                             ).reset_index()
    
    
    df_agg.head(5)
    

    The resulting data table will look like this:

    |----------------|-------------|--------------------|-------------------|---------------------|
    |      key1      |     key2    |    mean_of_value1  |    sum_of_value2  |    count_of_value1  |
    |----------------|-------------|--------------------|-------------------|---------------------|
    |       a        |      c      |         1.5        |        17         |           2         |
    |       a        |      d      |         2.0        |         7         |           1         |   
    |       a        |      e      |         3.0        |         5         |           1         |        
    |       b        |      d      |         3.0        |         6         |           1         |     
    |----------------|-------------|--------------------|-------------------|---------------------|
    

    The SQL Equivalent of this is:

    SELECT
          key1
         ,key2
         ,AVG(value1) AS mean_of_value_1
         ,SUM(value2) AS sum_of_value_2
         ,COUNT(*) AS count_of_value1
    FROM
        df
    GROUP BY
         key1
        ,key2
    

    2. Create Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)

    If you want to do a SUMIF, COUNTIF etc like how you would do in Excel where there is no reduction in rows then you need to do this instead.

    df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')
    
    df.head(5)
    

    The resulting data frame will look like this with the same number of rows as the original:

    |----------------|-------------|------------|------------|-------------------------|
    |      key1      |     key2    |    value1  |    value2  | Total_of_value1_by_key1 |
    |----------------|-------------|------------|------------|-------------------------|
    |       a        |       c     |      1     |       9    |            8            |
    |       a        |       c     |      2     |       8    |            8            |
    |       a        |       d     |      2     |       7    |            8            |
    |       b        |       d     |      3     |       6    |            3            |
    |       a        |       e     |      3     |       5    |            8            |
    |----------------|-------------|------------|------------|-------------------------|
    

    3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)

    Finally, there might be cases where you want to create a Rank column which is the SQL Equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC)

    Here is how you do that.

     df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
                  .groupby(['key1']) \
                  .cumcount() + 1
    
     df.head(5) 
    

    Note: we make the code multi-line by adding \ in the end of each line.

    Here is how the resulting data frame looks like:

    |----------------|-------------|------------|------------|------------|
    |      key1      |     key2    |    value1  |    value2  |     RN     |
    |----------------|-------------|------------|------------|------------|
    |       a        |       c     |      1     |       9    |      4     |
    |       a        |       c     |      2     |       8    |      3     |
    |       a        |       d     |      2     |       7    |      2     |
    |       b        |       d     |      3     |       6    |      1     |
    |       a        |       e     |      3     |       5    |      1     |
    |----------------|-------------|------------|------------|------------|
    

    In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.

    Other aggregating operators:

    mean() Compute mean of groups

    sum() Compute sum of group values

    size() Compute group sizes

    count() Compute count of group

    std() Standard deviation of groups

    var() Compute variance of groups

    sem() Standard error of the mean of groups

    describe() Generates descriptive statistics

    first() Compute first of group values

    last() Compute last of group values

    nth() Take nth value, or a subset if n is a list

    min() Compute min of group values

    max() Compute max of group values

    Hope this helps.

    0 讨论(0)
  • 2020-11-22 08:41

    Question 1

    How to perform aggregation with pandas ?

    Expanded aggregation documentation.

    Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original. Some common aggregating functions are tabulated below:

    Function    Description
    mean()      Compute mean of groups
    sum()       Compute sum of group values
    size()      Compute group sizes
    count()     Compute count of group
    std()       Standard deviation of groups
    var()       Compute variance of groups
    sem()       Standard error of the mean of groups
    describe()  Generates descriptive statistics
    first()     Compute first of group values
    last()      Compute last of group values
    nth()       Take nth value, or a subset if n is a list
    min()       Compute min of group values
    max()       Compute max of group values
    
    np.random.seed(123)
    
    df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                       'B' : ['one', 'two', 'three','two', 'two', 'one'],
                       'C' : np.random.randint(5, size=6),
                       'D' : np.random.randint(5, size=6),
                       'E' : np.random.randint(5, size=6)})
    print (df)
         A      B  C  D  E
    0  foo    one  2  3  0
    1  foo    two  4  1  0
    2  bar  three  2  1  1
    3  foo    two  1  0  3
    4  bar    two  3  1  4
    5  foo    one  2  1  0
    

    Aggregation by filtered columns and cython implemented functions:

    df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    

    Aggregate function is using for all columns without specified in groupby function, here A, B columns:

    df2 = df.groupby(['A', 'B'], as_index=False).sum()
    print (df2)
         A      B  C  D  E
    0  bar  three  2  1  1
    1  bar    two  3  1  4
    2  foo    one  4  4  0
    3  foo    two  5  1  3
    

    You can also specify only some columns used for aggregation in a list after groupby function:

    df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
    print (df3)
         A      B  C  D
    0  bar  three  2  1
    1  bar    two  3  1
    2  foo    one  4  4
    3  foo    two  5  1
    

    Same results by using function DataFrameGroupBy.agg:

    df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    
    df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
    print (df2)
         A      B  C  D  E
    0  bar  three  2  1  1
    1  bar    two  3  1  4
    2  foo    one  4  4  0
    3  foo    two  5  1  3
    

    For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:

    df4 = (df.groupby(['A', 'B'])['C']
             .agg([('average','mean'),('total','sum')])
             .reset_index())
    print (df4)
         A      B  average  total
    0  bar  three      2.0      2
    1  bar    two      3.0      3
    2  foo    one      2.0      4
    3  foo    two      2.5      5
    

    If want to pass multiple functions is possible pass list of tuples:

    df5 = (df.groupby(['A', 'B'])
             .agg([('average','mean'),('total','sum')]))
    
    print (df5)
                    C             D             E      
              average total average total average total
    A   B                                              
    bar three     2.0     2     1.0     1     1.0     1
        two       3.0     3     1.0     1     4.0     4
    foo one       2.0     4     2.0     4     0.0     0
        two       2.5     5     0.5     1     1.5     3
    
        
    

    Then get MultiIndex in columns:

    print (df5.columns)
    MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
               labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
               
    

    And for converting to columns, flattening MultiIndex use map with join:

    df5.columns = df5.columns.map('_'.join)
    df5 = df5.reset_index()
    print (df5)
         A      B  C_average  C_total  D_average  D_total  E_average  E_total
    0  bar  three        2.0        2        1.0        1        1.0        1
    1  bar    two        3.0        3        1.0        1        4.0        4
    2  foo    one        2.0        4        2.0        4        0.0        0
    3  foo    two        2.5        5        0.5        1        1.5        3
    

    Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:

    df5 = df.groupby(['A', 'B']).agg(['mean','sum'])
        
    df5.columns = (df5.columns.map('_'.join)
                      .str.replace('sum','total')
                      .str.replace('mean','average'))
    df5 = df5.reset_index()
    print (df5)
         A      B  C_average  C_total  D_average  D_total  E_average  E_total
    0  bar  three        2.0        2        1.0        1        1.0        1
    1  bar    two        3.0        3        1.0        1        4.0        4
    2  foo    one        2.0        4        2.0        4        0.0        0
    3  foo    two        2.5        5        0.5        1        1.5        3
    

    If want specified each column with aggregated function separately pass dictionary:

    df6 = (df.groupby(['A', 'B'], as_index=False)
             .agg({'C':'sum','D':'mean'})
             .rename(columns={'C':'C_total', 'D':'D_average'}))
    print (df6)
         A      B  C_total  D_average
    0  bar  three        2        1.0
    1  bar    two        3        1.0
    2  foo    one        4        2.0
    3  foo    two        5        0.5
    

    You can pass custom function too:

    def func(x):
        return x.iat[0] + x.iat[-1]
    
    df7 = (df.groupby(['A', 'B'], as_index=False)
             .agg({'C':'sum','D': func})
             .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
    print (df7)
         A      B  C_total  D_sum_first_and_last
    0  bar  three        2                     2
    1  bar    two        3                     2
    2  foo    one        4                     4
    3  foo    two        5                     1
    

    Question 2

    No DataFrame after aggregation! What happened?

    Aggregation by 2 or more columns:

    df1 = df.groupby(['A', 'B'])['C'].sum()
    print (df1)
    A    B    
    bar  three    2
         two      3
    foo  one      4
         two      5
    Name: C, dtype: int32
    

    First check Index and type of pandas object:

    print (df1.index)
    MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
               labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
               names=['A', 'B'])
    
    print (type(df1))
    <class 'pandas.core.series.Series'>
    

    There are 2 solutions how get MultiIndex Series to columns:

    • add parameter as_index=False
    df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    
    • use Series.reset_index:
    df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    

    If group by one column:

    df2 = df.groupby('A')['C'].sum()
    print (df2)
    A
    bar    5
    foo    9
    Name: C, dtype: int32
    

    ... get Series with Index:

    print (df2.index)
    Index(['bar', 'foo'], dtype='object', name='A')
    
    print (type(df2))
    <class 'pandas.core.series.Series'>
    

    And solution is same like in MultiIndex Series:

    df2 = df.groupby('A', as_index=False)['C'].sum()
    print (df2)
         A  C
    0  bar  5
    1  foo  9
    
    df2 = df.groupby('A')['C'].sum().reset_index()
    print (df2)
         A  C
    0  bar  5
    1  foo  9
    

    Question 3

    How to aggregate mainly strings columns (to lists, tuples, strings with separator)?

    df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                       'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                       'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                       'D' : [1,2,3,2,3,1,2]})
    print (df)
       A      B      C  D
    0  a    one  three  1
    1  c    two    one  2
    2  b  three    two  3
    3  b    two    two  2
    4  a    two  three  3
    5  c    one    two  1
    6  b  three    one  2
    

    Instead of an aggregetion function it is possible to pass list, tuple, set for converting column:

    df1 = df.groupby('A')['B'].agg(list).reset_index()
    print (df1)
       A                    B
    0  a           [one, two]
    1  b  [three, two, three]
    2  c           [two, one]
    

    Alternative is use GroupBy.apply:

    df1 = df.groupby('A')['B'].apply(list).reset_index()
    print (df1)
       A                    B
    0  a           [one, two]
    1  b  [three, two, three]
    2  c           [two, one]
    

    For converting to strings with separator use .join only if string column:

    df2 = df.groupby('A')['B'].agg(','.join).reset_index()
    print (df2)
       A                B
    0  a          one,two
    1  b  three,two,three
    2  c          two,one
    

    If numeric column use lambda function with astype for converting to strings:

    df3 = (df.groupby('A')['D']
             .agg(lambda x: ','.join(x.astype(str)))
             .reset_index())
    print (df3)
       A      D
    0  a    1,3
    1  b  3,2,2
    2  c    2,1
    

    Another solution is converting to strings before groupby:

    df3 = (df.assign(D = df['D'].astype(str))
             .groupby('A')['D']
             .agg(','.join).reset_index())
    print (df3)
       A      D
    0  a    1,3
    1  b  3,2,2
    2  c    2,1
    

    For converting all columns pass no list of column(s) after groupby. There is no column D because automatic exclusion of 'nuisance' columns, it means all numeric columns are excluded.

    df4 = df.groupby('A').agg(','.join).reset_index()
    print (df4)
       A                B            C
    0  a          one,two  three,three
    1  b  three,two,three  two,two,one
    2  c          two,one      one,two
    

    So it's necessary to convert all columns into strings, then get all columns:

    df5 = (df.groupby('A')
             .agg(lambda x: ','.join(x.astype(str)))
             .reset_index())
    print (df5)
       A                B            C      D
    0  a          one,two  three,three    1,3
    1  b  three,two,three  two,two,one  3,2,2
    2  c          two,one      one,two    2,1
    

    Question 4

    How to aggregate counts?

    df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                       'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                       'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                       'D' : [np.nan,2,3,2,3,np.nan,2]})
    print (df)
       A      B      C    D
    0  a    one  three  NaN
    1  c    two    NaN  2.0
    2  b  three    NaN  3.0
    3  b    two    two  2.0
    4  a    two  three  3.0
    5  c    one    two  NaN
    6  b  three    one  2.0
    

    Function GroupBy.size for size of each group:

    df1 = df.groupby('A').size().reset_index(name='COUNT')
    print (df1)
       A  COUNT
    0  a      2
    1  b      3
    2  c      2
    

    Function GroupBy.count exclude missing values:

    df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
    print (df2)
       A  COUNT
    0  a      2
    1  b      2
    2  c      1
    

    Function should be used fo multiple columns for count non missing values:

    df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
    print (df3)
       A  B_COUNT  C_COUNT  D_COUNT
    0  a        2        2        1
    1  b        3        2        3
    2  c        2        1        1
    

    Related function Series.value_counts return size object containing counts of unique values in descending order so that the first element is the most frequently-occurring element. Excludes NaNs values by default.

    df4 = (df['A'].value_counts()
                  .rename_axis('A')
                  .reset_index(name='COUNT'))
    print (df4)
       A  COUNT
    0  b      3
    1  a      2
    2  c      2
    

    If you want same output like using function groupby + size add Series.sort_index:

    df5 = (df['A'].value_counts()
                  .sort_index()
                  .rename_axis('A')
                  .reset_index(name='COUNT'))
    print (df5)
       A  COUNT
    0  a      2
    1  b      3
    2  c      2
    

    Question 5

    How to create new column filled by aggregated values?

    Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped

    Pandas documentation for more information.

    np.random.seed(123)
    
    df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                        'B' : ['one', 'two', 'three','two', 'two', 'one'],
                        'C' : np.random.randint(5, size=6),
                        'D' : np.random.randint(5, size=6)})
    print (df)
         A      B  C  D
    0  foo    one  2  3
    1  foo    two  4  1
    2  bar  three  2  1
    3  foo    two  1  0
    4  bar    two  3  1
    5  foo    one  2  1
    
    
    df['C1'] = df.groupby('A')['C'].transform('sum')
    df['C2'] = df.groupby(['A','B'])['C'].transform('sum')
    
    
    df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
    df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')
    
    print (df)
    
         A      B  C  D  C1  C2  C3  D3  C4  D4
    0  foo    one  2  3   9   4   9   5   4   4
    1  foo    two  4  1   9   5   9   5   5   1
    2  bar  three  2  1   5   2   5   2   2   1
    3  foo    two  1  0   9   5   9   5   5   1
    4  bar    two  3  1   5   3   5   2   3   1
    5  foo    one  2  1   9   4   9   5   4   4
    
    0 讨论(0)
提交回复
热议问题