How to speed up Pandas multilevel dataframe sum?

前端未结

关注

 2  594

I am trying to speed up the sum for several big multilevel dataframes.

Here is a sample:

df1 = mul_df(5000,30,400) # mul_df to create a big multileve


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2021-01-07 12:51
              
            
            
                                                                       
Other Observations


You cannot expect more speedup if you have only 2 cores on your machine. In the end numexpression relies on parallelisation and the performant usage of the cpu cache.
What you do is to some extend wrong. Numexpressions on DataFrames are fast, but wrong. They do not return the right result, if the DataFrames are not equally indexed. Different sorting will already trouble you, which I show below. 
If you add DataFrames with different indexes, the whole stuff is not that performant anymore. Well, Pandas does quite a good job to add the proper rows for you by looking up the corrsponding index entries. This comes with the natural cost. 


In the following my Observations:
- First, I reproduce your test case and come to other results. Using numexpression under the hood of Pandas increases performance significantly.
- Second, I sort one of the four DataFrames in descending order and rerun all cases. The performance breaks, and additionally, (as expected) numexpression evaluation on Pandas DataFrames leads to wrong results.

Equal Indices on all Frames

This case reproduces your case. The only difference is, that I create copies of the inital DataFrame instance. So there is nothing shared. There are different objects (ids) in use to make sure, that numexpression can deal with it.

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst


df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy() 


pd.options.compute.use_numexpr = False


%%timeit
df1 + df2 + df3 + df4


564 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


pd.options.compute.use_numexpr = True


%%timeit 
df1 + df2 + df3 + df4


152 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


import numexpr as ne


%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')


66.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))


True


(Slightly) Different Indices on some Frames

Here I sort one of the DataFrames in descending order, therefore changing the index and reshuffling the rows in the dataframe internal numpy array.

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst


df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy().sort_index(ascending=False)


pd.options.compute.use_numexpr = False


%%timeit
df1 + df2 + df3 + df4


1.36 s ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


pd.options.compute.use_numexpr = True


%%timeit 
df1 + df2 + df3 + df4


928 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


import numexpr as ne


%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')


68 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))


False


Conclusions

By using numexpr 


Quite some speedup is gained when operating on equally indexed DataFrames.
The same is true if you have other expressions with a single dataframe, as 2 * df1.
This is not the case if operations between DataFrames with different indices are used.
It leads even to completely wrong results if one evaluates expressions containing Pandas DataFrames. By chance they can be right. But numexpression is made for optimizing expressions on Numpy arrays. 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2021-01-07 13:03
              
            
            
                                                                       
method 1: On my machine not so bad (with numexpr disabled)

In [41]: from pandas.core import expressions as expr

In [42]: expr.set_use_numexpr(False)

In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop


method 2: Using numexpr (which is by default enabled if numexpr is installed)

In [44]: expr.set_use_numexpr(True)

In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop


method 3: Direct use of numexpr

In [34]: import numexpr as ne

In [46]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop


These speedups are achieved using numexpr because:


avoids using intermediate temporary arrays (which in the case you are presenting is probably
 quite inefficient in numpy, I suspect this is being evaluated like ((df1+df2)+df3)+df4
uses multi-cores as available


As I hinted above, pandas uses numexpr under the hood for certain types of ops (in 0.11), e.g. df1 + df2 would be evaluated this way, however the example you are giving here will result in several calls to numexpr (this is method 2 is faster than method 1.). Using the direct (method 3) ne.evaluate(...) achieves even more speedups.

Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval which will in effect do exactly what my example above does. Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037)

In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop


Lastly to answer your question, cython will not help here at all; numexpr is quite efficient at this type of problem (that said, there are situation where cython is helpful)

One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices). also they should be a single dtype
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复