Large Pandas Dataframe parallel processing

后端未结

关注

 2  2050

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.

Eg.

df = db.query(\"select id, a_lo


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2021-02-07 21:20
              
            
            
                                                                       
Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-02-07 21:21
              
            
            
                                                                       
The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.  

One solution is to store your data in HDF (df.to_hdf) using the table format.  You can then use select to select subsets of data for further processing.  In practice this will be too slow for interactive use.  It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step. 

An alternative would be to explore numba.vectorize with target='parallel'.  This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.

In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复