Does multicore computing using R's doParallel package use more memory?

后端未结

关注

 2  454

抹茶落季 2021-02-10 17:51

I just tested an elastic net with and without a parallel backend. The call is:

enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( m


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   伪装坚强ぢ
                                             
                
                
                (楼主)
            
              
              
                2021-02-10 18:07
              

            
            
                        
In multithreaded programs, threads share lots of memory.  It's primarily the stack that isn't shared between threads.  But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.

However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system).  That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.

In other words, using:

registerDoParallel(7)


may be much more memory efficient than using:

cl <- makeCluster(7)
registerDoParallel(cl)


since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.

However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs.  This is a great advantage since it can allow programs to scale.  Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.

So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster.  I don't think there is any way to use shared memory packages such as Rdsm with the caret package.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复