问题
I have a matrix that is half-sparse. Half of all cells are blank (na) so when I try to run the 'mice' it tries to work on all of them. I'm only interested in a subset.
Question: In the following code, how do I make "mice" only operate on the first two columns? Is there a clean way to do this using row-lag or row-lead, so that the content of the previous row can help patch holes in the current row?
set.seed(1)
#domain
x <- seq(from=0,to=10,length.out=1000)
#ranges
y <- sin(x) +sin(x/2) + rnorm(n = length(x))
y2 <- sin(x) +sin(x/2) + rnorm(n = length(x))
#kill 50% of cells
idx_na1 <- sample(x=1:length(x),size = length(x)/2)
y[idx_na1] <- NA
#kill more cells
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
y2[idx_na2] <- NA
#assemble base data
my_data <- data.frame(x,y,y2)
#make the rest of the data
for (i in 3:50){
my_data[,i] <- rnorm(n = length(x))
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
my_data[idx_na2,i] <- NA
}
#imputation
est <- mice(my_data)
data2 <- complete(est)
str(data2[,1:3])
Places that I have looked for answers:
- help document (link)
- google of course...
- https://stats.stackexchange.com/questions/99334/fast-missing-data-imputation-in-r-for-big-data-that-is-more-sophisticated-than-s
回答1:
I think what you are looking for can be done by modifying the parameter "where" of the mice function. The parameter "where" is equal to a matrix (or dataframe) with the same size as the dataset on which you are carrying out the imputation. By default, the "where" parameter is equal to is.na(data): a matrix with cells equal to "TRUE" when the value is missing in your dataset and equal to "FALSE" otherwise. This means that by default, every missing value in your dataset will be imputed. Now if you want to change this and only impute the values in a specific column (in my example column 2) of your dataset you can do this:
# Define arbitrary matrix with TRUE values when data is missing and FALSE otherwise
A <- is.na(data)
# Replace all the other columns which are not the one you want to impute (let say column 2)
A[,-2] <- FALSE
# Run the mice function
imputed_data <- mice(data, where = A)
回答2:
A faster way instead of the where
parameter might be to use the method
parameter. You can set this parameter to ""
for the columns/variables you want to skip. Downside is that automatic determination of the method will not work. So:
imp <- mice(data,
method = ifelse(colnames(data) == "your_var", "logreg", ""))
But you can get the default method from the documentation:
defaultMethod
... By default, the method uses
pmm
, predictive mean matching (numeric data)logreg
, logistic regression imputation (binary data, factor with 2 levels)polyreg
, polytomous regression imputation for unordered categorical data (factor > 2 levels)polr
, proportional odds model for (ordered, > 2 levels).
回答3:
Your question isn't entirely clear to me. Are you saying you wish to only operate on two columns? In that case mice(my_data[,1:2])
will work. Or you want to use all the data but only fill in missing values for some columns? To do this, I'd just create an indicator matrix along the following lines:
isNA <- data.frame(apply(my_data, 2, is.na))
est <- mice(my_data)
mapply(function(x, isna) {
x[isNA == 1] <- NA
return(x)
}, <each MI mice return object column-wise>, isNA)
For your final question, "can I use mice
for rolling data imputation?" I believe the answer is no. But you should double check the documentation.
来源:https://stackoverflow.com/questions/42161230/r-mice-missing-variable-imputation-how-to-only-do-one-column-in-sparse-m