How to group “remaining” results beyond Top N into “Others” with pandas

后端 未结 2 1349
孤独总比滥情好
孤独总比滥情好 2021-02-11 01:05

When group a pandas dataframe by one column say \"version\" and which has 10 distinct versions. How can one plot the Top 3 (which cover over 90%) and put the small remainders in

相关标签:
2条回答
  • 2021-02-11 01:38
    # number of top-n you want
    n = 2
    
    # group by & sort descending
    df_sorted = (df
                    .groupby('Version').sum()
                    .sort_values('Value', ascending=False)
                    .reset_index()
                )
    
    # rename rows other than top-n to 'Others'
    df_sorted.loc[df_sorted.index >= n, 'Version'] = 'Others'
    
    # re-group by again
    df_sorted.groupby('Version').sum()
    
    0 讨论(0)
  • 2021-02-11 01:42

    I assume you also want the Other group to be summed, for your example to a total of 3?

    If i was aiming to win the Pandas one-liner competition this would be my entry:

    df.replace(df.groupby('Version').sum().sort('Value', ascending=False).index[2:], 'Other').groupby('Version').sum()
    
             Value
    Version       
    Other        3
    Top1        19
    Top2        13
    

    But that's completely unreadable, so lets break it down:

    You already showed how to sum each group, sorting this result and selecting anything outside of the top 2 can be done with:

    not_top2 = df.groupby('Version').sum().sort('Value', ascending=False).index[2:]
    

    In this example not_top2 contains Other1 and Other2.

    We can replace those Versions to a common name with:

    dfnew  = df.replace(not_top2, 'Other')
    print dfnew
    
      Version  Value
    0    Top1     14
    1    Top1      3
    2    Top1      2
    3    Top2      6
    4    Top2      7
    5   Other      1
    6   Other      2
    

    The above replaces the contents of not_top2 in any column. A little substep is needed if you expect this value to occur in any other column than Version.

    Whats left is to do your original grouping again:

    dfnew.groupby('Version').sum()
    

    Which gives:

             Value
    Version       
    Other        3
    Top1        19
    Top2        13
    
    0 讨论(0)
提交回复
热议问题