问题
I want to run 150 multiple imputations by using mice
in R
. However, in order to save some computing time, I would lie to subdivide the process in parallel streams (as suggested by Stef van Buuren in "Flexible Imputation for Missing Data").
My question is: how to do that?
I can imagine 2 options:
opt.1:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
and then combine the imputations together by using complete
and as.mids
afterwards
opt.2:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
by adding VAL_1to150
otherwise it seems to me (I may be wrong) that if they all run with the same dataset and the same seed you will have 150 times the same result.
Are there any other options?
Thanks
回答1:
So the main problem is combining the imputations and as I see it there are three options, using ibind
, complete
as described or trying to keep the mids structure. I strongly suggest the ibind
solution. The others are left in the answer for those curious.
Get parallel results
Before doing anything we need to get the parallel mice imputations. The parallel part is rather simple, all we need to do is to use the parallel package and make sure that we set the seed using the clusterSetRNGStream
:
library(parallel)
# Using all cores can slow down the computer
# significantly, I therefore try to leave one
# core alone in order to be able to do something
# else during the time the code runs
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
clusterExport(cl, "nhanes")
clusterEvalQ(cl, library(mice))
imp_pars <-
parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
mice(nhanes, m = 30, printFlag = FALSE)
})
stopCluster(cl)
The above will yield cores_2_use * 30
imputed datasets.
Using ibind
As @AleksanderBlekh suggested, the mice::ibind
is probably the best, most straightforward solution:
imp_merged <- imp_pars[[1]]
for (n in 2:length(imp_pars)){
imp_merged <-
ibind(imp_merged,
imp_pars[[n]])
}
Using foreach
with ibind
The perhaps the simplest alternative is to use foreach
:
library(foreach)
library(doParallel)
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)
library(mice)
imp_merged <-
foreach(no = 1:cores_2_use,
.combine = ibind,
.export = "nhanes",
.packages = "mice") %dopar%
{
mice(nhanes, m = 30, printFlag = FALSE)
}
stopCluster(cl)
Using complete
Extracting the full datasets using complete(..., action="long")
, rbind
-ing these and then using as.mids
other mice
objects may work well but it generates a slimmer object than what the other two approaches:
merged_df <- nhanes
merged_df <-
cbind(data.frame(.imp = 0,
.id = 1:nrow(nhanes)),
merged_df)
for (n in 1:length(imp_pars)){
tmp <- complete(imp_pars[[n]], action = "long")
tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
merged_df <-
rbind(merged_df,
tmp)
}
imp_merged <-
as.mids(merged_df)
# Compare the most important the est and se for easier comparison
cbind(summary(pool(with(data=imp_merged,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = 60,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Gives the output:
est se est se
(Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
age -3.56928102 1.35801557 -3.65568620 1.27603817
hyp 1.63952970 2.05618895 1.60216683 2.17650536
chl 0.05396451 0.02278867 0.05525561 0.02087995
Keeping a correct mids-object
My alternative approach below shows how to merge imputation objects and retain the full functionality behind the mids
object. Since the ibind
solution I've left this in for anyone interested in exploring how to merge complex lists.
I've looked into mice
's mids-object and there are a few step that you have to take in order to get at least a similar mids-object after running in parallel. If we examine the mids-object and compare two objects with two different setups we get:
library(mice)
imp <- list()
imp <- c(imp,
list(mice(nhanes, m = 40)))
imp <- c(imp,
list(mice(nhanes, m = 20)))
sapply(names(imp[[1]]),
function(n)
try(all(useful::compare.list(imp[[1]][[n]],
imp[[2]][[n]]))))
Where you can see that the call, m, imp, chainMean, and chainVar differ between the two runs. Out of these the imp is without doubt the most important but it seems like a wise option to update the other components as well. We will therefore start by building a mice merger function:
mergeMice <- function (imp) {
merged_imp <- NULL
for (n in 1:length(imp)){
if (is.null(merged_imp)){
merged_imp <- imp[[n]]
}else{
counter <- merged_imp$m
# Update counter
merged_imp$m <-
merged_imp$m + imp[[n]]$m
# Rename chains
dimnames(imp[[n]]$chainMean)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
dimnames(imp[[n]]$chainVar)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
# Merge chains
merged_imp$chainMean <-
abind::abind(merged_imp$chainMean,
imp[[n]]$chainMean)
merged_imp$chainVar <-
abind::abind(merged_imp$chainVar,
imp[[n]]$chainVar)
for (nn in names(merged_imp$imp)){
# Non-imputed variables are not in the
# data.frame format but are null
if (!is.null(imp[[n]]$imp[[nn]])){
colnames(imp[[n]]$imp[[nn]]) <-
(counter + 1):merged_imp$m
merged_imp$imp[[nn]] <-
cbind(merged_imp$imp[[nn]],
imp[[n]]$imp[[nn]])
}
}
}
}
# TODO: The function should update the $call parameter
return(merged_imp)
}
We can now simply merge the two above generated imputations through:
merged_imp <- mergeMice(imp)
merged_imp_pars <- mergeMice(imp_pars)
Now it seems that we get the right output:
# Compare the three alternatives
cbind(
summary(pool(with(data=merged_imp,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=merged_imp_pars,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = merged_imp$m,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Gives:
est se est se
(Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
age -3.67906629 1.19873118 -3.64395716 1.1476377
hyp 1.72637216 2.01171565 1.71063127 1.9936347
chl 0.05590999 0.02350609 0.05476829 0.0213819
est se
(Intercept) 20.14271905 3.60702992
age -3.78345532 1.21550474
hyp 1.77361005 2.11415290
chl 0.05648672 0.02046868
Ok, that's it. Have fun.
来源:https://stackoverflow.com/questions/24040280/parallel-computation-of-multiple-imputation-by-using-mice-r-package