Apply different functions to different items in group object: Python pandas

后端未结

关注

 3  1708

Suppose I have a dataframe as follows:

In [1]: test_dup_df

Out[1]:
                  exe_price exe_vol flag 
2008-03-13 14:41:07  84.5    200     yes
2008-03-13


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-02-03 11:54
              
            
            
                                                                       
Apply your own function:

In [12]: def func(x):
             exe_price = (x['exe_price']*x['exe_vol']).sum() / x['exe_vol'].sum()
             exe_vol = x['exe_vol'].sum()
             flag = True        
             return Series([exe_price, exe_vol, flag], index=['exe_price', 'exe_vol', 'flag'])


In [13]: test_dup_df.groupby(test_dup_df.index).apply(func)
Out[13]:
                    exe_price exe_vol  flag
date_time                                  
2008-03-13 14:41:07      84.5     200  True 
2008-03-13 14:41:37        85   10000  True
2008-03-13 14:41:38      84.5   69700  True
2008-03-13 14:41:39      84.5    1200  True
2008-03-13 14:42:00      84.5    1000  True
2008-03-13 14:42:08      84.5     300  True
2008-03-13 14:42:10     20.71  100000  True
2008-03-13 14:42:15      84.5    5000  True
2008-03-13 14:42:16      84.5    3200  True

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-02-03 12:03
              
            
            
                                                                       
Not terribly familiar with pandas, but in pure numpy you could do:

tot_vol = np.sum(grouped['exe_vol'])
avg_price = np.average(grouped['exe_price'], weights=grouped['exe_vol'])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2021-02-03 12:10
              
            
            
                                                                       
I like @waitingkuo's answer because it is very clear and readable.

I'm keeping this around anyway because it does appear to be faster -- at least with Pandas version 0.10.0. The situation may (hopefully) change in the future, so be sure to rerun the benchmark especially if you are using a different version of Pandas.

import pandas as pd
import io
import timeit

data = '''\
date time       exe_price    exe_vol flag
2008-03-13 14:41:07  84.5    200     yes
2008-03-13 14:41:37  85.0    10000   yes
2008-03-13 14:41:38  84.5    69700   yes
2008-03-13 14:41:39  84.5    1200    yes
2008-03-13 14:42:00  84.5    1000    yes
2008-03-13 14:42:08  84.5    300     yes
2008-03-13 14:42:10  10    88100   yes
2008-03-13 14:42:10  100    11900   yes
2008-03-13 14:42:15  84.5    5000    yes
2008-03-13 14:42:16  84.5    3200    yes'''

df = pd.read_table(io.BytesIO(data), sep='\s+', parse_dates=[[0, 1]],
                   index_col=0)


def func(subf):
    exe_vol = subf['exe_vol'].sum()
    exe_price = ((subf['exe_price']*subf['exe_vol']).sum()
                 / exe_vol)
    flag = True
    return pd.Series([exe_price, exe_vol, flag],
                     index=['exe_price', 'exe_vol', 'flag'])
    # return exe_price

def using_apply():
    return df.groupby(df.index).apply(func)

def using_helper_column():
    df['weight'] = df['exe_price'] * df['exe_vol']
    grouped = df.groupby(level=0, group_keys=True)
    result = grouped.agg({'weight': 'sum', 'exe_vol': 'sum'})
    result['exe_price'] = result['weight'] / result['exe_vol']
    result['flag'] = True
    result = result.drop(['weight'], axis=1)
    return result

result = using_apply()
print(result)
result = using_helper_column()
print(result)

time_apply = timeit.timeit('m.using_apply()',
                      'import __main__ as m ',
                      number=1000)
time_helper = timeit.timeit('m.using_helper_column()',
                      'import __main__ as m ',
                      number=1000)
print('using_apply: {t}'.format(t = time_apply))
print('using_helper_column: {t}'.format(t = time_helper))


yields

                     exe_vol  exe_price  flag
date_time                                    
2008-03-13 14:41:07      200      84.50  True
2008-03-13 14:41:37    10000      85.00  True
2008-03-13 14:41:38    69700      84.50  True
2008-03-13 14:41:39     1200      84.50  True
2008-03-13 14:42:00     1000      84.50  True
2008-03-13 14:42:08      300      84.50  True
2008-03-13 14:42:10   100000      20.71  True
2008-03-13 14:42:15     5000      84.50  True
2008-03-13 14:42:16     3200      84.50  True


with timeit benchmarks of:

using_apply: 3.0081038475
using_helper_column: 1.35300707817

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复