When group a pandas dataframe by one column say \"version\" and which has 10 distinct versions. How can one plot the Top 3 (which cover over 90%) and put the small remainders in
# number of top-n you want
n = 2
# group by & sort descending
df_sorted = (df
.groupby('Version').sum()
.sort_values('Value', ascending=False)
.reset_index()
)
# rename rows other than top-n to 'Others'
df_sorted.loc[df_sorted.index >= n, 'Version'] = 'Others'
# re-group by again
df_sorted.groupby('Version').sum()
I assume you also want the Other
group to be summed, for your example to a total of 3?
If i was aiming to win the Pandas one-liner competition this would be my entry:
df.replace(df.groupby('Version').sum().sort('Value', ascending=False).index[2:], 'Other').groupby('Version').sum()
Value
Version
Other 3
Top1 19
Top2 13
But that's completely unreadable, so lets break it down:
You already showed how to sum each group, sorting this result and selecting anything outside of the top 2 can be done with:
not_top2 = df.groupby('Version').sum().sort('Value', ascending=False).index[2:]
In this example not_top2
contains Other1
and Other2
.
We can replace those Versions
to a common name with:
dfnew = df.replace(not_top2, 'Other')
print dfnew
Version Value
0 Top1 14
1 Top1 3
2 Top1 2
3 Top2 6
4 Top2 7
5 Other 1
6 Other 2
The above replaces the contents of not_top2
in any column. A little substep is needed if you expect this value to occur in any other column than Version
.
Whats left is to do your original grouping again:
dfnew.groupby('Version').sum()
Which gives:
Value
Version
Other 3
Top1 19
Top2 13