I understand that passing a function as a group key calls the function once per index value with the return values being used as the group names. What I can't figure out is how to call the function on column values.
So I can do this:
people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']) def GroupFunc(x): if len(x) > 3: return 'Group1' else: return 'Group2' people.groupby(GroupFunc).sum()
This splits the data into two groups, one of which has index values of length 3 or less, and the other with length three or more. But how can I pass one of the column values? So for example if column d value for each index point is greater than 1. I realise I could just do the following:
people.groupby(people.a > 1).sum()
But I want to know how to do this in a user defined function for future reference.
Something like:
def GroupColFunc(x): if x > 1: return 'Group1' else: return 'Group2'
But how do I call this? I tried
people.groupby(GroupColFunc(people.a))
and similar variants but this does not work.
How do I pass the column values to the function? How would I pass multiple column values e.g. to group on whether people.a > people.b for example?