slice original df after groupby().nlargest(x) operation

前端未结

关注

 1  685

Given the problems with groupby() and nlargest() as described here and here. I am trying to work around the problems.

Note: for simplicity I us

相关标签:

1条回答

温柔的废话

2021-01-26 12:28

Unless I'm missing something (and I agree there are bugs lurking in the pandas code here), we can bypass any difficulties relatively simply.

Method #1: use loc and idxmax:

In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()]
Out[21]: 
     city1     city2  p234_r_c plant1_type plant2_type
6   Austin    Dallas       3.0        COAL        NUKE
3  Chicago     Miami       0.5        COAL    COMBCYCL
0  Chicago   Toronto       5.0    COMBCYCL        COAL
2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
5  Houston    Dallas       4.0    COMBCYCL        NUKE
4    Miami    Dallas       1.0        NUKE        COAL

In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()]
Out[22]: 
     city1     city2  p234_r_c plant1_type plant2_type
6   Austin    Dallas       3.0        COAL        NUKE
5  Houston    Dallas       4.0    COMBCYCL        NUKE
4    Miami    Dallas       1.0        NUKE        COAL
1  Chicago   Detroit       4.0    COMBCYCL        COAL
3  Chicago     Miami       0.5        COAL    COMBCYCL
2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
0  Chicago   Toronto       5.0    COMBCYCL        COAL

Method #2: sort by p234_r_c and use last:

In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last()
Out[17]: 
     city1 plant1_type plant2_type     city2  p234_r_c
0   Austin        COAL        NUKE    Dallas       3.0
1  Chicago        COAL    COMBCYCL     Miami       0.5
2  Chicago    COMBCYCL        COAL   Toronto       5.0
3  Chicago        NUKE    COMBCYCL  St.Louis       2.0
4  Houston    COMBCYCL        NUKE    Dallas       4.0
5    Miami        NUKE        COAL    Dallas       1.0

In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last()
Out[18]: 
      city2 plant1_type plant2_type    city1  p234_r_c
0    Dallas        COAL        NUKE   Austin       3.0
1    Dallas    COMBCYCL        NUKE  Houston       4.0
2    Dallas        NUKE        COAL    Miami       1.0
3   Detroit    COMBCYCL        COAL  Chicago       4.0
4     Miami        COAL    COMBCYCL  Chicago       0.5
5  St.Louis        NUKE    COMBCYCL  Chicago       2.0
6   Toronto    COMBCYCL        COAL  Chicago       5.0

If you want to be able to get multiple responses as well, while nlargest and nsmallest are broken, I think it's simplest to sort and then use head or tail. For example:

In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2)
Out[27]: 
     city1     city2  p234_r_c plant1_type plant2_type
3  Chicago     Miami       0.5        COAL    COMBCYCL
4    Miami    Dallas       1.0        NUKE        COAL
2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
6   Austin    Dallas       3.0        COAL        NUKE
1  Chicago   Detroit       4.0    COMBCYCL        COAL
5  Houston    Dallas       4.0    COMBCYCL        NUKE
0  Chicago   Toronto       5.0    COMBCYCL        COAL

0 讨论(0)