pandas groupby will by default sort. But I\'d like to change the sort order. How can I do this?
I\'m guessing that I can\'t apply a sort method to the returned gro
Other instance of preserving the order or sort by descending:
In [97]: import pandas as pd
In [98]: df = pd.DataFrame({'name':['A','B','C','A','B','C','A','B','C'],'Year':[2003,2002,2001,2003,2002,2001,2003,2002,2001]})
#### Default groupby operation:
In [99]: for each in df.groupby(["Year"]): print each
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
### order preserved:
In [100]: for each in df.groupby(["Year"], sort=False): print each
(2003, Year name
0 2003 A
3 2003 A
6 2003 A)
(2002, Year name
1 2002 B
4 2002 B
7 2002 B)
(2001, Year name
2 2001 C
5 2001 C
8 2001 C)
In [106]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"]))
Out[106]:
Year name
Year
2003 0 2003 A
3 2003 A
6 2003 A
2002 1 2002 B
4 2002 B
7 2002 B
2001 2 2001 C
5 2001 C
8 2001 C
In [107]: df.groupby(["Year"], sort=False).apply(lambda x: x.sort_values(["Year"])).reset_index(drop=True)
Out[107]:
Year name
0 2003 A
1 2003 A
2 2003 A
3 2002 B
4 2002 B
5 2002 B
6 2001 C
7 2001 C
8 2001 C
As of Pandas 0.18 one way to do this is to use the sort_index
method of the grouped data.
Here's an example:
np.random.seed(1)
n=10
df = pd.DataFrame({'mygroups' : np.random.choice(['dogs','cats','cows','chickens'], size=n),
'data' : np.random.randint(1000, size=n)})
grouped = df.groupby('mygroups', sort=False).sum()
grouped.sort_index(ascending=False)
print grouped
data
mygroups
dogs 1831
chickens 1446
cats 933
As you can see, the groupby column is sorted descending now, indstead of the default which is ascending.
Do your groupby, and use reset_index() to make it back into a DataFrame. Then sort.
grouped = df.groupby('mygroups').sum().reset_index()
grouped.sort_values('mygroups', ascending=False)
Similar to one of the answers above, but try adding .sort_values()
to your .groupby()
will allow you to change the sort order. If you need to sort on a single column, it would look like this:
df.groupby('group')['id'].count().sort_values(ascending=False)
ascending=False
will sort from high to low, the default is to sort from low to high.
*Careful with some of these aggregations. For example .size() and .count() return different values since .size() counts NaNs.
What is the difference between size and count in pandas?
You can do a sort_values()
on the dataframe before you do the groupby. Pandas preserves the ordering in the groupby.
In [44]: d.head(10)
Out[44]:
name transcript exon
0 ENST00000456328 2 1
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
4 ENST00000456328 2 2
5 ENST00000450305 2 4
6 ENST00000450305 2 5
7 ENST00000456328 2 3
8 ENST00000450305 2 6
9 ENST00000488147 1 11
for _, a in d.head(10).sort_values(["transcript", "exon"]).groupby(["name", "transcript"]): print(a)
name transcript exon
1 ENST00000450305 2 1
2 ENST00000450305 2 2
3 ENST00000450305 2 3
5 ENST00000450305 2 4
6 ENST00000450305 2 5
8 ENST00000450305 2 6
name transcript exon
0 ENST00000456328 2 1
4 ENST00000456328 2 2
7 ENST00000456328 2 3
name transcript exon
9 ENST00000488147 1 11
This kind of operation is covered under hierarchical indexing. Check out the examples here
When you groupby, you're making new indices. If you also pass a list through .agg(). you'll get multiple columns. I was trying to figure this out and found this thread via google.
It turns out if you pass a tuple corresponding to the exact column you want sorted on.
Try this:
# generate toy data
ex = pd.DataFrame(np.random.randint(1,10,size=(100,3)), columns=['features', 'AUC', 'recall'])
# pass a tuple corresponding to which specific col you want sorted. In this case, 'mean' or 'AUC' alone are not unique.
ex.groupby('features').agg(['mean','std']).sort_values(('AUC', 'mean'))
This will output a df sorted by the AUC-mean column only.