I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take s
Use groupby/filter
:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter
here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop