Does multicore computing using R's doParallel package use more memory?

后端 未结 2 472
生来不讨喜
生来不讨喜 2021-02-10 17:18

I just tested an elastic net with and without a parallel backend. The call is:

enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( m         


        
相关标签:
2条回答
  • 2021-02-10 18:13

    There is a minimum number of characters elsewise I would simply have typed: 1) Yes. 2) No, er, maybe. There are packages that use a "shared memory" model for parallel computation, but R's more thoroughly tested packages don't use it.

    http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf

    http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf

    http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf

    0 讨论(0)
  • 2021-02-10 18:23

    In multithreaded programs, threads share lots of memory. It's primarily the stack that isn't shared between threads. But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.

    However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.

    In other words, using:

    registerDoParallel(7)
    

    may be much more memory efficient than using:

    cl <- makeCluster(7)
    registerDoParallel(cl)
    

    since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.

    However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. This is a great advantage since it can allow programs to scale. Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.

    So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. I don't think there is any way to use shared memory packages such as Rdsm with the caret package.

    0 讨论(0)
提交回复
热议问题