Pandas get topmost n records within each group

前端 未结 3 1609
无人共我
无人共我 2020-11-22 06:07

Suppose I have pandas DataFrame like this:

>>> df = pd.DataFrame({\'id\':[1,1,1,2,2,2,2,3,4],\'value\':[1,2,3,1,2,3,4,1,1]})
>>> df
   id           


        
相关标签:
3条回答
  • 2020-11-22 06:46

    Did you try df.groupby('id').head(2)

    Ouput generated:

    >>> df.groupby('id').head(2)
           id  value
    id             
    1  0   1      1
       1   1      2 
    2  3   2      1
       4   2      2
    3  7   3      1
    4  8   4      1
    

    (Keep in mind that you might need to order/sort before, depending on your data)

    EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

    >>> df.groupby('id').head(2).reset_index(drop=True)
        id  value
    0   1      1
    1   1      2
    2   2      1
    3   2      2
    4   3      1
    5   4      1
    
    0 讨论(0)
  • 2020-11-22 06:47

    Sometimes sorting the whole data ahead is very time consuming. We can groupby first and doing topk for each group:

    g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
    
    0 讨论(0)
  • 2020-11-22 06:49

    Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:

    In [23]: df.groupby('id')['value'].nlargest(2)
    Out[23]: 
    id   
    1   2    3
        1    2
    2   6    4
        5    3
    3   7    1
    4   8    1
    dtype: int64
    

    There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

    If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.

    (Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

    0 讨论(0)
提交回复
热议问题