I want to apply multiple functions of multiple columns to a groupby object which results in a new pandas.DataFrame
.
I know how to do it in seperate steps:
Another solid variation of the solution is to do what @MaxU did with this solution to a similar question and wrap the individual functions in a Pandas series, thus only requiring a reset_index()
to return a dataframe.
First, define the functions for transformations:
def ed(group):
return group.elapsed_time * group.num_cores).sum() / 86400
def rd(group):
return group.running_time * group.num_cores).sum() / 86400
Wrap them up in a Series using get_stats
:
def get_stats(group):
return pd.Series({'elapsed_days': ed(group),
'running_days':rd(group)})
Finally:
lasts.groupby('user').apply(get_stats).reset_index()
In response to the bounty, we can make it more general, by using partial application, from the standard libraries functools.partial
function.
import functools
import pandas as pd
#same data as other answer:
lasts = pd.DataFrame({'user':['a','s','d','d'],
'elapsed_time':[40000,50000,60000,90000],
'running_time':[30000,20000,30000,15000],
'num_cores':[7,8,9,4]})
#define the desired lambda as a function:
def myfunc(column, df, cores):
return (column * df.ix[column.index][cores]).sum()/86400
#use the partial to define the function with a given column and df:
mynewfunc = functools.partial(myfunc, df = lasts, cores = 'num_cores')
#agg by the partial function
lasts.groupby('user').agg({'elapsed_time':mynewfunc, 'running_time':mynewfunc})
Which gives us:
running_time elapsed_time
user
a 2.430556 3.240741
d 3.819444 10.416667
s 1.851852 4.629630
This is not super useful for the example given, but may be more useful as a general example.
This agg function might be what you're looking for.
I added an example dataset and applied the operation to a copy of lasts
which I named lasts_
.
import pandas as pd
lasts = pd.DataFrame({'user' :['james','james','james','john','john'],
'elapsed_time':[ 200000, 400000, 300000,800000,900000],
'running_time':[ 100000, 100000, 200000,600000,700000],
'num_cores' :[ 4, 4, 4, 8, 8] })
# create temporary df to add columns to, without modifying original dataframe
lasts_ = pd.Series.to_frame(lasts.loc[:,'user']) # using 'user' column to initialize copy of new dataframe. to_frame gives dataframe instead of series so more columns can be added below
lasts_['elapsed_days'] = lasts.loc[:,'elapsed_time'] * lasts.loc[:,'num_cores'] / 86400
lasts_['running_days'] = lasts.loc[:,'running_time'] * lasts.loc[:,'num_cores'] / 86400
# aggregate
by_user = lasts_.groupby('user').agg({'elapsed_days': 'sum',
'running_days': 'sum' })
# by_user:
# user elapsed_days running_days
# james 41.66666666666667 18.51851851851852
# john 157.4074074074074 120.37037037037037
If you want to keep 'user' as normal column instead of index column, use:
by_user = lasts_.groupby('user', as_index=False).agg({'elapsed_days': 'sum',
'running_days': 'sum'})
To use the agg
method on a groupby
object by using data from other columns of the same dataframe you could do the following:
Define your functions (lambda
functions or not) that take as an input a Series
, and get the data from other column(s) using the df.loc[series.index, col]
syntax. With this example:
ed = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400.
rd = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400.
where lasts
is the main DataFrame, and we access the data in the column num_cores
thanks to the .loc
method.
Create a dictionary with these functions and the name for the newly created columns. The keys are the name of the columns on which to apply each function, and the value is another dictionary where the key is the name of the function and the value is the function.
my_func = {"elapsed_time" : {"elapsed_day" : ed},
"running_time" : {"running_days" : rd}}
Groupby and aggregate:
user_df = lasts.groupby("user").agg(my_func)
user_df
elapsed_time running_time
elapsed_day running_days
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852
If you want to remove the old column names:
user_df.columns = user_df.columns.droplevel(0)
user_df
elapsed_day running_days
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852
HTH
I think you can avoid agg
or apply
and rather first multiple by mul, then div and last use groupby
by index
with aggregating
sum:
lasts = pd.DataFrame({'user':['a','s','d','d'],
'elapsed_time':[40000,50000,60000,90000],
'running_time':[30000,20000,30000,15000],
'num_cores':[7,8,9,4]})
print (lasts)
elapsed_time num_cores running_time user
0 40000 7 30000 a
1 50000 8 20000 s
2 60000 9 30000 d
3 90000 4 15000 d
by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
print (elapsed_days)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))
print (user_df)
elapsed_days running_days
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852
lasts = lasts.set_index('user')
print (lasts[['elapsed_time','running_time']].mul(lasts['num_cores'], axis=0)
.div(86400)
.groupby(level=0)
.sum())
elapsed_time running_time
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852
Here is a solution which closely resembles the original idea expressed under "I suspect there is a better way".
I'll use the same testing data as the other answers:
lasts = pd.DataFrame({'user':['a','s','d','d'],
'elapsed_time':[40000,50000,60000,90000],
'running_time':[30000,20000,30000,15000],
'num_cores':[7,8,9,4]})
groupby.apply
can accept a function which returns a dataframe and will then automatically stitch the returned dataframes together. There are two small catches in the wording below. The first is noticing that the values passed to DataFrame
are in fact single-element lists instead of just numbers.
def aggfunc(group):
""" This function mirrors the OP's idea. Note the values below are lists """
return pd.DataFrame({'elapsed_days': [(group.elapsed_time * group.num_cores).sum() / 86400],
'running_days': [(group.running_time * group.num_cores).sum() / 86400]})
user_df = lasts.groupby('user').apply(aggfunc)
Result:
elapsed_days running_days
user
a 0 3.240741 2.430556
d 0 10.416667 3.819444
s 0 4.629630 1.851852
The second is that the returned dataframe has a hierarchical index (that column of zeros), which can be flattened as shown below:
user_df.index = user_df.index.levels[0]
Result:
elapsed_days running_days
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852