Pandas groupby nlargest sum

后端 未结 2 1714
终归单人心
终归单人心 2020-11-30 04:44

I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.

State          


        
相关标签:
2条回答
  • 2020-11-30 05:39

    You can use apply after performing the groupby:

    df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
    

    I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.

    The resulting output:

    State
    Alabama    150
    Wyoming    330
    

    EDIT

    A slightly cleaner approach, as suggested by @cᴏʟᴅsᴘᴇᴇᴅ:

    df.groupby('State')['Population'].nlargest(2).sum(level=0)
    

    This is slightly slower than using apply on larger DataFrames though.

    Using the following setup:

    import numpy as np
    import pandas as pd
    from string import ascii_letters
    
    n = 10**6
    df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
                       'B': np.random.randint(10**7, size=n)})
    

    I get the following timings:

    In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
    103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
    147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.

    0 讨论(0)
  • 2020-11-30 05:42

    Using agg, the grouping logic looks like:

    df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})

    This results in another dataframe object; which you could query to find the most populous states, etc.

               Population
    State
    Alabama    150
    Wyoming    330
    
    0 讨论(0)
提交回复
热议问题