问题:

I understand that passing a function as a group key calls the function once per index value with the return values being used as the group names. What I can't figure out is how to call the function on column values.

So I can do this:

people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']) def GroupFunc(x):     if len(x) > 3:         return 'Group1'     else:         return 'Group2'  people.groupby(GroupFunc).sum()

This splits the data into two groups, one of which has index values of length 3 or less, and the other with length three or more. But how can I pass one of the column values? So for example if column d value for each index point is greater than 1. I realise I could just do the following:

people.groupby(people.a > 1).sum()

But I want to know how to do this in a user defined function for future reference.

Something like:

def GroupColFunc(x): if x > 1:     return 'Group1' else:     return 'Group2'

But how do I call this? I tried

people.groupby(GroupColFunc(people.a))

and similar variants but this does not work.

How do I pass the column values to the function? How would I pass multiple column values e.g. to group on whether people.a > people.b for example?

回答1:

To group by a > 1, you can define your function like:

>>> def GroupColFunc(df, ind, col): ...     if df[col].loc[ind] > 1: ...         return 'Group1' ...     else: ...         return 'Group2' ...

An then call it like

>>> people.groupby(lambda x: GroupColFunc(people, x, 'a')).sum()                a         b         c         d        e Group2 -2.384614 -0.762208  3.359299 -1.574938 -2.65963

Or you can do it only with anonymous function:

>>> people.groupby(lambda x: 'Group1' if people['b'].loc[x] > people['a'].loc[x] else 'Group2').sum()                a         b         c         d         e Group1 -3.280319 -0.007196  1.525356  0.324154 -1.002439 Group2  0.895705 -0.755012  1.833943 -1.899092 -1.657191

As said in documentation, you can also group by passing Series providing a label -> group name mapping:

>>> mapping = np.where(people['b'] > people['a'], 'Group1', 'Group2') >>> mapping Joe       Group2 Steve     Group1 Wes       Group2 Jim       Group1 Travis    Group1 dtype: string48 >>> people.groupby(mapping).sum()                a         b         c         d         e Group1 -3.280319 -0.007196  1.525356  0.324154 -1.002439 Group2  0.895705 -0.755012  1.833943 -1.899092 -1.657191

转载请标明出处:Groupby with User Defined Functions Pandas

文章来源: Groupby with User Defined Functions Pandas

标签

pan