slice original df after groupby().nlargest(x) operation

前端 未结 1 684
闹比i
闹比i 2021-01-26 11:44

Given the problems with groupby() and nlargest() as described here and here. I am trying to work around the problems.

Note: for simplicity I us

1条回答
  •  温柔的废话
    2021-01-26 12:28

    Unless I'm missing something (and I agree there are bugs lurking in the pandas code here), we can bypass any difficulties relatively simply.

    Method #1: use loc and idxmax:

    In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()]
    Out[21]: 
         city1     city2  p234_r_c plant1_type plant2_type
    6   Austin    Dallas       3.0        COAL        NUKE
    3  Chicago     Miami       0.5        COAL    COMBCYCL
    0  Chicago   Toronto       5.0    COMBCYCL        COAL
    2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
    5  Houston    Dallas       4.0    COMBCYCL        NUKE
    4    Miami    Dallas       1.0        NUKE        COAL
    
    In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()]
    Out[22]: 
         city1     city2  p234_r_c plant1_type plant2_type
    6   Austin    Dallas       3.0        COAL        NUKE
    5  Houston    Dallas       4.0    COMBCYCL        NUKE
    4    Miami    Dallas       1.0        NUKE        COAL
    1  Chicago   Detroit       4.0    COMBCYCL        COAL
    3  Chicago     Miami       0.5        COAL    COMBCYCL
    2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
    0  Chicago   Toronto       5.0    COMBCYCL        COAL
    

    Method #2: sort by p234_r_c and use last:

    In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last()
    Out[17]: 
         city1 plant1_type plant2_type     city2  p234_r_c
    0   Austin        COAL        NUKE    Dallas       3.0
    1  Chicago        COAL    COMBCYCL     Miami       0.5
    2  Chicago    COMBCYCL        COAL   Toronto       5.0
    3  Chicago        NUKE    COMBCYCL  St.Louis       2.0
    4  Houston    COMBCYCL        NUKE    Dallas       4.0
    5    Miami        NUKE        COAL    Dallas       1.0
    
    In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last()
    Out[18]: 
          city2 plant1_type plant2_type    city1  p234_r_c
    0    Dallas        COAL        NUKE   Austin       3.0
    1    Dallas    COMBCYCL        NUKE  Houston       4.0
    2    Dallas        NUKE        COAL    Miami       1.0
    3   Detroit    COMBCYCL        COAL  Chicago       4.0
    4     Miami        COAL    COMBCYCL  Chicago       0.5
    5  St.Louis        NUKE    COMBCYCL  Chicago       2.0
    6   Toronto    COMBCYCL        COAL  Chicago       5.0
    

    If you want to be able to get multiple responses as well, while nlargest and nsmallest are broken, I think it's simplest to sort and then use head or tail. For example:

    In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2)
    Out[27]: 
         city1     city2  p234_r_c plant1_type plant2_type
    3  Chicago     Miami       0.5        COAL    COMBCYCL
    4    Miami    Dallas       1.0        NUKE        COAL
    2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
    6   Austin    Dallas       3.0        COAL        NUKE
    1  Chicago   Detroit       4.0    COMBCYCL        COAL
    5  Houston    Dallas       4.0    COMBCYCL        NUKE
    0  Chicago   Toronto       5.0    COMBCYCL        COAL
    

    0 讨论(0)
提交回复
热议问题