Using R Parallel with other R packages

前端 未结 2 1200
被撕碎了的回忆
被撕碎了的回忆 2021-01-06 07:03

I am working on a very time intensive analysis using the LQMM package in R. I set the model to start running on Thursday, it is now Monday, and is still running. I am confid

相关标签:
2条回答
  • 2021-01-06 07:45

    It sounds like you want to use parallel computing to make a single call of the lqmm function execute more quickly. To do that, you either have to:

    • Split the one call of lqmm into multiple function calls;
    • Parallelize a loop inside lqmm.

    Some functions can be split up into multiple smaller pieces by specifying a smaller iteration value. Examples include parallelizing randomForest over the ntree argument, or parallelizing kmeans over the nstart argument. Another common case is to split the input data into smaller pieces, operate on the pieces in parallel, and then combine the results. That is often done when the input data is a data frame or a matrix.

    But many times in order to parallelize a function you have to modify it. It may actually be easier because you may not have to figure out how to split up the problem and combine the partial results. You may only need to convert an lapply call into a parallel lapply, or convert a for loop into a foreach loop. However, it's often time consuming to understand the code. It's also a good idea to profile the code so that your parallelization really speeds up the function call.

    I suggest that you download the source distribution of the lqmm package and start reading the code. Try to understand it's structure and get an idea which loops could be executed in parallel. If you're lucky, you might figure out a way to split one call into multiple calls, but otherwise you'll have to rebuild a modified version of the package on your machine.

    0 讨论(0)
  • 2021-01-06 07:52

    The dependent libraries will need to be evaluated on all your nodes. The function clusterEvalQ is foreseen inside the parallel package for this purpose. You might also need to export some of your data to the global environments of your subnodes: For this you can use the clusterExport function. Also view this page for more info on other relevant functions that might be useful to you.

    In general, to speed up your application by using multiple cores you will have to split up your problem in multiple subpieces that can be processed in parallel on different cores. To achieve this in R, you will first need to create a cluster and assign a particular number of cores to it. Next, You will have to register the cluster, export the required variables to the nodes and then evaluate the necessary libraries on each of your subnodes. The exact way that you will setup your cluster and launch the nodes will depend on the type of sublibraries and functions that you will use. As an example, your clustersetup might look like this when you choose to utilize the doParallel package (and most of the other parallelisation sublibraries/functions):

    library(doParallel)
    nrCores <- detectCores()
    cl <- makeCluster(nrCores)
    registerDoParallel(cl); 
    clusterExport(cl,c("g1data"),envir=environment());
    clusterEvalQ(cl,library("lqmm"))
    

    The cluster is now prepared. You can now assign subparts of the global task to each individual node in your cluster. In the general example below each node in your cluster will process subpart i of the global task. In the example we will use the foreach %dopar% functionality that is provided by the doParallel package:

    The doParallel package provides a parallel backend for the foreach/%dopar% function using the parallel package of R 2.14.0 and later.

    Subresults will automatically be added to the resultList. Finally, when all subprocesses are finished we merge the results:

    resultList <- foreach(i = 1:nrCores) %dopar%
    {
       #process part i of your data.
    }
    stopCluster(cl)
    #merge data..
    

    Since your question was not specifically on how to split up your data I will let you figure out the details of this part for yourself. However, you can find a more detailed example using the doParallel package in my answer to this post.

    0 讨论(0)
提交回复
热议问题