问题
I would like to add random NA
to a data.frame in R. So far I've looked into these questions:
R: Randomly insert NAs into dataframe proportionaly
How do I add random NAs into a data frame
add random missing values to a complete data frame (in R)
Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:
- Add really random NA, and not the same amount by row or by column
- Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
- Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
- Is computationnaly efficient for big datasets.
- Add the proportion/number of NA independently of already present NA in the input.
Anyone has an idea? I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4. Thanks.
[note] the exact proportion, rounded at +/- 1NA of course.
回答1:
This is the way that I do it for my paper on library(imputeMulti)
which is currently in review at JSS. This inserts NA
's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0
.
createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}
Obviously you should use a random seed for reproducibility, which can be specified before the function call.
This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.
Edit: I do assume that x
is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)
回答2:
Some users reported that Alex's answer did not address condition N°5 of my question. Indeed, when adding random NA
on a dataframe that already contains missing values, the new ones will sometimes fall on the initial ones, and the final proportion will be somewhere between initial proportion and desired proportion... So I expand on Alex's function to comply with all 5 conditions:
I modify his createNAs
function so that it enables one of 3 options:
- option complement: complement with NA up to the desired %
- option add : add % of NA in addition to those already present
- option none : add a % of NA regardless of those already present
For option 1 and 2, the function will work recursively until reached the desired proportion of NA
:
createNAs <- function (x, pctNA = 0.0, option = "add"){
prop.NA = function(x) sum(is.na(x))/prod(dim(x))
initial.pctNA = prop.NA(x)
if ( (option =="complement") & (initial.pctNA > pctNA) ){
message("The data already had more NA than the target percentage. Returning original data")
return(x)
}
if ( (option == "none") || (initial.pctNA == 0) ){
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
} else { # if another option than none:
target = ifelse(option=="complement", pctNA, pctNA + initial.pctNA)
while (prop.NA(x) < target) {
prop.remaining.to.add = target - prop.NA(x)
x = createNAs(x, prop.remaining.to.add, option = "none")
}
return(x)
}
}
来源:https://stackoverflow.com/questions/39513837/add-exact-proportion-of-random-missing-values-to-data-frame