I am trying to use groupby
, nlargest
, and sum
functions in Pandas together, but having trouble making it work.
State
You can use apply
after performing the groupby
:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2)
will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply
/agg
.
The resulting output:
State
Alabama 150
Wyoming 330
EDIT
A slightly cleaner approach, as suggested by @cᴏʟᴅsᴘᴇᴇᴅ:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply
on larger DataFrames though.
Using the following setup:
import numpy as np
import pandas as pd
from string import ascii_letters
n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level
kwarg in sum
performing a second groupby
under the hood.
Using agg
, the grouping logic looks like:
df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})
This results in another dataframe object; which you could query to find the most populous states, etc.
Population
State
Alabama 150
Wyoming 330