Pandas: grouping and aggregation with multiple functions

前端 未结 4 1721
南笙
南笙 2021-01-14 04:52

Situation

I have a pandas dataframe defined as follows:

import pandas as pd

headers = [\'Group\', \'Element\', \'Case\', \'Score\', \'Evaluation\'         


        
相关标签:
4条回答
  • 2021-01-14 05:06

    You can use apply instead of agg to construct all the columns in one go.

    result = (
        df.groupby('Group').apply(lambda x: [np.max(x.Score),
                                  df.loc[x.Score.idxmax(),'Element'],
                                  df.loc[x.Score.idxmax(),'Case'],
                                  np.min(x.Evaluation)])
          .apply(pd.Series)
          .rename(columns={0:'Max_score_value',
                           1:'Max_score_element',
                           2:'Max_score_case',
                           3:'Min_evaluation'})
          .reset_index()
    )
    
    
    
    result
    Out[9]: 
      Group  Max_score_value  Max_score_element Max_score_case  Min_evaluation
    0     A             9.19                  1              y            0.41
    1     B             9.12                  2              x            0.10
    
    0 讨论(0)
  • 2021-01-14 05:09

    Here is possible solution with pd.merge

    >> r = df.groupby('Group') \
    >>       .agg({'Score': 'idxmax', 'Evaluation': 'min'}) \
    >>       .rename(columns={'Score': 'idx'})
    >> for c in ['Score', 'Element', 'Case']:
    >>   r = pd.merge(r, df[[c]], how='left', left_on='idx', right_index=True)
    >> r.drop('Score_idx', axis=1).rename(columns={'Score': 'Max_score_value', 
    >>                                             'Element': 'Max_score_element', 
    >>                                             'Case': 'Max_score_case'})
           Evaluation  Max_score_value  Max_score_element Max_score_case
    Group                                                               
    A            0.41             9.19                  1              y
    B            0.10             9.12                  2              x
    

    Though it provides the desired output, I am not sure about if it's not less efficient than yours approach.

    0 讨论(0)
  • 2021-01-14 05:11

    Starting from the result data frame, you can transform in two steps as follows to the format you need:

    # collapse multi index column to single level column
    result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]
    ​
    # split the idxmax column into two columns
    result = result.assign(
        max_score_element = result.idxmax_Score.str[0],
        max_score_case = result.idxmax_Score.str[1]
    ).drop('idxmax_Score', 1)
    
    result
    
    #Group  max_Score   min_Evaluation  max_score_case  max_score_element
    #0   A       9.19             0.41               y                  1
    #1   B       9.12             0.10               x                  2
    

    An alternative starting from original df using join, which may not be as efficient but less verbose similar to @tarashypka's idea:

    (df.groupby('Group')
       .agg({'Score': 'idxmax', 'Evaluation': 'min'})
       .set_index('Score')
       .join(df.drop('Evaluation',1))
       .reset_index(drop=True))
    
    #Evaluation  Group  Element   Case  Score
    #0     0.41      A        1      y   9.19
    #1     0.10      B        2      x   9.12
    

    Naive timing with the example data set:

    %%timeit 
    (df.groupby('Group')
     .agg({'Score': 'idxmax', 'Evaluation': 'min'})
     .set_index('Score')
     .join(df.drop('Evaluation',1))
     .reset_index(drop=True))
    # 100 loops, best of 3: 3.47 ms per loop
    
    %%timeit
    result = (
        df.set_index(['Element', 'Case'])
        .groupby('Group')
        .agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'})
        .reset_index()
    )
    ​
    result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]
    ​
    result = result.assign(
        max_score_element = result.idxmax_Score.str[0],
        max_score_case = result.idxmax_Score.str[1]
    ).drop('idxmax_Score', 1)
    # 100 loops, best of 3: 7.61 ms per loop
    
    0 讨论(0)
  • 2021-01-14 05:23

    My Take

    g = df.set_index('Group').groupby(level='Group', group_keys=False)
    
    result = g.apply(
        pd.DataFrame.nlargest, n=1, columns='Score'
    )
    
    def f(x):
        x = 'value' if x == 'Score' else x
        return 'Max_score_' + x.lower()
    
    result.drop('Evaluation', 1).rename(columns=f).assign(
        Min_evaluation=g.Evaluation.min().values).reset_index()
    
      Group  Max_score_element Max_score_case  Max_score_value  Min_evaluation
    0     A                  1              y             9.19            0.41
    1     B                  2              x             9.12            0.10
    
    0 讨论(0)
提交回复
热议问题