How to group “remaining” results beyond Top N into “Others” with pandas

后端 未结 2 1361
孤独总比滥情好
孤独总比滥情好 2021-02-11 01:05

When group a pandas dataframe by one column say \"version\" and which has 10 distinct versions. How can one plot the Top 3 (which cover over 90%) and put the small remainders in

2条回答
  •  無奈伤痛
    2021-02-11 01:42

    I assume you also want the Other group to be summed, for your example to a total of 3?

    If i was aiming to win the Pandas one-liner competition this would be my entry:

    df.replace(df.groupby('Version').sum().sort('Value', ascending=False).index[2:], 'Other').groupby('Version').sum()
    
             Value
    Version       
    Other        3
    Top1        19
    Top2        13
    

    But that's completely unreadable, so lets break it down:

    You already showed how to sum each group, sorting this result and selecting anything outside of the top 2 can be done with:

    not_top2 = df.groupby('Version').sum().sort('Value', ascending=False).index[2:]
    

    In this example not_top2 contains Other1 and Other2.

    We can replace those Versions to a common name with:

    dfnew  = df.replace(not_top2, 'Other')
    print dfnew
    
      Version  Value
    0    Top1     14
    1    Top1      3
    2    Top1      2
    3    Top2      6
    4    Top2      7
    5   Other      1
    6   Other      2
    

    The above replaces the contents of not_top2 in any column. A little substep is needed if you expect this value to occur in any other column than Version.

    Whats left is to do your original grouping again:

    dfnew.groupby('Version').sum()
    

    Which gives:

             Value
    Version       
    Other        3
    Top1        19
    Top2        13
    

提交回复
热议问题