Apply different functions to different items in group object: Python pandas

后端 未结 3 1704
暗喜
暗喜 2021-02-03 11:46

Suppose I have a dataframe as follows:

In [1]: test_dup_df

Out[1]:
                  exe_price exe_vol flag 
2008-03-13 14:41:07  84.5    200     yes
2008-03-13         


        
相关标签:
3条回答
  • 2021-02-03 11:54

    Apply your own function:

    In [12]: def func(x):
                 exe_price = (x['exe_price']*x['exe_vol']).sum() / x['exe_vol'].sum()
                 exe_vol = x['exe_vol'].sum()
                 flag = True        
                 return Series([exe_price, exe_vol, flag], index=['exe_price', 'exe_vol', 'flag'])
    
    
    In [13]: test_dup_df.groupby(test_dup_df.index).apply(func)
    Out[13]:
                        exe_price exe_vol  flag
    date_time                                  
    2008-03-13 14:41:07      84.5     200  True 
    2008-03-13 14:41:37        85   10000  True
    2008-03-13 14:41:38      84.5   69700  True
    2008-03-13 14:41:39      84.5    1200  True
    2008-03-13 14:42:00      84.5    1000  True
    2008-03-13 14:42:08      84.5     300  True
    2008-03-13 14:42:10     20.71  100000  True
    2008-03-13 14:42:15      84.5    5000  True
    2008-03-13 14:42:16      84.5    3200  True
    
    0 讨论(0)
  • 2021-02-03 12:03

    Not terribly familiar with pandas, but in pure numpy you could do:

    tot_vol = np.sum(grouped['exe_vol'])
    avg_price = np.average(grouped['exe_price'], weights=grouped['exe_vol'])
    
    0 讨论(0)
  • 2021-02-03 12:10

    I like @waitingkuo's answer because it is very clear and readable.

    I'm keeping this around anyway because it does appear to be faster -- at least with Pandas version 0.10.0. The situation may (hopefully) change in the future, so be sure to rerun the benchmark especially if you are using a different version of Pandas.

    import pandas as pd
    import io
    import timeit
    
    data = '''\
    date time       exe_price    exe_vol flag
    2008-03-13 14:41:07  84.5    200     yes
    2008-03-13 14:41:37  85.0    10000   yes
    2008-03-13 14:41:38  84.5    69700   yes
    2008-03-13 14:41:39  84.5    1200    yes
    2008-03-13 14:42:00  84.5    1000    yes
    2008-03-13 14:42:08  84.5    300     yes
    2008-03-13 14:42:10  10    88100   yes
    2008-03-13 14:42:10  100    11900   yes
    2008-03-13 14:42:15  84.5    5000    yes
    2008-03-13 14:42:16  84.5    3200    yes'''
    
    df = pd.read_table(io.BytesIO(data), sep='\s+', parse_dates=[[0, 1]],
                       index_col=0)
    
    
    def func(subf):
        exe_vol = subf['exe_vol'].sum()
        exe_price = ((subf['exe_price']*subf['exe_vol']).sum()
                     / exe_vol)
        flag = True
        return pd.Series([exe_price, exe_vol, flag],
                         index=['exe_price', 'exe_vol', 'flag'])
        # return exe_price
    
    def using_apply():
        return df.groupby(df.index).apply(func)
    
    def using_helper_column():
        df['weight'] = df['exe_price'] * df['exe_vol']
        grouped = df.groupby(level=0, group_keys=True)
        result = grouped.agg({'weight': 'sum', 'exe_vol': 'sum'})
        result['exe_price'] = result['weight'] / result['exe_vol']
        result['flag'] = True
        result = result.drop(['weight'], axis=1)
        return result
    
    result = using_apply()
    print(result)
    result = using_helper_column()
    print(result)
    
    time_apply = timeit.timeit('m.using_apply()',
                          'import __main__ as m ',
                          number=1000)
    time_helper = timeit.timeit('m.using_helper_column()',
                          'import __main__ as m ',
                          number=1000)
    print('using_apply: {t}'.format(t = time_apply))
    print('using_helper_column: {t}'.format(t = time_helper))
    

    yields

                         exe_vol  exe_price  flag
    date_time                                    
    2008-03-13 14:41:07      200      84.50  True
    2008-03-13 14:41:37    10000      85.00  True
    2008-03-13 14:41:38    69700      84.50  True
    2008-03-13 14:41:39     1200      84.50  True
    2008-03-13 14:42:00     1000      84.50  True
    2008-03-13 14:42:08      300      84.50  True
    2008-03-13 14:42:10   100000      20.71  True
    2008-03-13 14:42:15     5000      84.50  True
    2008-03-13 14:42:16     3200      84.50  True
    

    with timeit benchmarks of:

    using_apply: 3.0081038475
    using_helper_column: 1.35300707817
    
    0 讨论(0)
提交回复
热议问题