Parallel computation of multiple imputation by using mice R package

后端 未结 1 613
清酒与你
清酒与你 2020-12-23 08:25

I want to run 150 multiple imputations by using mice in R. However, in order to save some computing time, I would lie to subdivide the process in p

相关标签:
1条回答
  • 2020-12-23 09:02

    So the main problem is combining the imputations and as I see it there are three options, using ibind, complete as described or trying to keep the mids structure. I strongly suggest the ibind solution. The others are left in the answer for those curious.

    Get parallel results

    Before doing anything we need to get the parallel mice imputations. The parallel part is rather simple, all we need to do is to use the parallel package and make sure that we set the seed using the clusterSetRNGStream:

    library(parallel)
    # Using all cores can slow down the computer
    # significantly, I therefore try to leave one
    # core alone in order to be able to do something 
    # else during the time the code runs
    cores_2_use <- detectCores() - 1
    
    cl <- makeCluster(cores_2_use)
    clusterSetRNGStream(cl, 9956)
    clusterExport(cl, "nhanes")
    clusterEvalQ(cl, library(mice))
    imp_pars <- 
      parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
        mice(nhanes, m = 30, printFlag = FALSE)
      })
    stopCluster(cl)
    

    The above will yield cores_2_use * 30 imputed datasets.

    Using ibind

    As @AleksanderBlekh suggested, the mice::ibind is probably the best, most straightforward solution:

    imp_merged <- imp_pars[[1]]
    for (n in 2:length(imp_pars)){
      imp_merged <- 
        ibind(imp_merged,
              imp_pars[[n]])
    }
    

    Using foreach with ibind

    The perhaps the simplest alternative is to use foreach:

    library(foreach)
    library(doParallel)
    cl <- makeCluster(cores_2_use)
    clusterSetRNGStream(cl, 9956)
    registerDoParallel(cl)
    
    library(mice)
    imp_merged <-
      foreach(no = 1:cores_2_use, 
              .combine = ibind, 
              .export = "nhanes",
              .packages = "mice") %dopar%
    {
      mice(nhanes, m = 30, printFlag = FALSE)
    }
    stopCluster(cl)
    

    Using complete

    Extracting the full datasets using complete(..., action="long"), rbind-ing these and then using as.mids other mice objects may work well but it generates a slimmer object than what the other two approaches:

    merged_df <- nhanes
    merged_df <- 
      cbind(data.frame(.imp = 0,
                       .id = 1:nrow(nhanes)),
            merged_df)
    for (n in 1:length(imp_pars)){
      tmp <- complete(imp_pars[[n]], action = "long")
      tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
      merged_df <- 
        rbind(merged_df,
              tmp)
    }
    
    imp_merged <- 
      as.mids(merged_df)
    
    # Compare the most important the est and se for easier comparison
    cbind(summary(pool(with(data=imp_merged,
                            exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
          summary(pool(with(data=mice(nhanes, 
                                      m = 60, 
                                      printFlag = FALSE),
                            exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
    

    Gives the output:

                        est         se         est         se
    (Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
    age         -3.56928102 1.35801557 -3.65568620 1.27603817
    hyp          1.63952970 2.05618895  1.60216683 2.17650536
    chl          0.05396451 0.02278867  0.05525561 0.02087995
    

    Keeping a correct mids-object

    My alternative approach below shows how to merge imputation objects and retain the full functionality behind the mids object. Since the ibind solution I've left this in for anyone interested in exploring how to merge complex lists.

    I've looked into mice's mids-object and there are a few step that you have to take in order to get at least a similar mids-object after running in parallel. If we examine the mids-object and compare two objects with two different setups we get:

    library(mice)
    imp <- list()
    imp <- c(imp,
             list(mice(nhanes, m = 40)))
    imp <- c(imp,
             list(mice(nhanes, m = 20)))
    
    sapply(names(imp[[1]]),
           function(n)
             try(all(useful::compare.list(imp[[1]][[n]], 
                                          imp[[2]][[n]]))))
    

    Where you can see that the call, m, imp, chainMean, and chainVar differ between the two runs. Out of these the imp is without doubt the most important but it seems like a wise option to update the other components as well. We will therefore start by building a mice merger function:

    mergeMice <- function (imp) {
      merged_imp <- NULL
      for (n in 1:length(imp)){
        if (is.null(merged_imp)){
          merged_imp <- imp[[n]]
        }else{
          counter <- merged_imp$m
          # Update counter
          merged_imp$m <- 
            merged_imp$m + imp[[n]]$m
          # Rename chains
          dimnames(imp[[n]]$chainMean)[[3]] <-
            sprintf("Chain %d", (counter + 1):merged_imp$m)
          dimnames(imp[[n]]$chainVar)[[3]] <-
            sprintf("Chain %d", (counter + 1):merged_imp$m)
          # Merge chains
          merged_imp$chainMean <- 
            abind::abind(merged_imp$chainMean, 
                         imp[[n]]$chainMean)
          merged_imp$chainVar <- 
            abind::abind(merged_imp$chainVar, 
                         imp[[n]]$chainVar)
          for (nn in names(merged_imp$imp)){
            # Non-imputed variables are not in the
            # data.frame format but are null
            if (!is.null(imp[[n]]$imp[[nn]])){
              colnames(imp[[n]]$imp[[nn]]) <- 
                (counter + 1):merged_imp$m
              merged_imp$imp[[nn]] <- 
                cbind(merged_imp$imp[[nn]],
                      imp[[n]]$imp[[nn]])
            }
          }
        }
      }
      # TODO: The function should update the $call parameter
      return(merged_imp)
    }
    

    We can now simply merge the two above generated imputations through:

    merged_imp <- mergeMice(imp)
    merged_imp_pars <- mergeMice(imp_pars)
    

    Now it seems that we get the right output:

    # Compare the three alternatives
    cbind(
      summary(pool(with(data=merged_imp,
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
     summary(pool(with(data=merged_imp_pars,
                        exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
     summary(pool(with(data=mice(nhanes, 
                                 m = merged_imp$m, 
                                 printFlag = FALSE),
                       exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
    

    Gives:

                        est         se         est        se
    (Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
    age         -3.67906629 1.19873118 -3.64395716 1.1476377
    hyp          1.72637216 2.01171565  1.71063127 1.9936347
    chl          0.05590999 0.02350609  0.05476829 0.0213819
                        est         se
    (Intercept) 20.14271905 3.60702992
    age         -3.78345532 1.21550474
    hyp          1.77361005 2.11415290
    chl          0.05648672 0.02046868
    

    Ok, that's it. Have fun.

    0 讨论(0)
提交回复
热议问题