Sample a single row, per column, with substantial missing data

问题

As an example of my data frame, which I will call df1, I have GROUP1 with three rows of data, and GROUP2 with two rows of data. I have three variables, X1, X2, and X3:

GROUP          X1    X2   X3
GROUP1         A     NA   NA
GROUP1         NA    NA   T
GROUP1         C     T    G   
GROUP2         NA    NA   C
GROUP2         G     NA   T

I am halfway to my answer, based on a previous question and answer (Sample a single row, per column, within a subset of a data frame in R, while following conditions) except I am having issues using characters.

I would like to sample a single variable, per column from GROUP1, to make a new row representing GROUP1. I do not want to sample one single and complete row from GROUP1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP2, variable X2, above).

For example, after sampling, I could have as a result:

GROUP         X1    X2   X3
GROUP1        A     T    T
GROUP2        G     NA   C

Only GROUP2, variable X2, can result in NA here. I actually have 300 taxa, 40 groups, 160000 variables, and a substantial number of NA's.

When I use:

library(data.table)

setDT(df1)[,lapply(.SD, function(x)
if(all(is.na(x))) NA_character_ else sample(na.omit(x),1)) , by = GROUP]

I end up with a warning:

Column 2 of result for group 2 is type 'character' but expecting type    
'integer'. Column types must be consistent for each group.

However, this warning does not seem to apply to only those variables of groups composed entirely of NA's.

If I instead replace NA_character_ with NA_integer_, some columns result in the sum of non-NA rows for the group's variable, rather a sample from across the rows.

回答1:

You can use this data.table call:

setDT(df1)[ , lapply(.SD, 
  function(x) x[!is.na(x)][sample(sum(!is.na(x)), 1)]), by = GROUP]

Or you can tweak your original one

setDT(df1)[,lapply(.SD, function(x)
  if(all(is.na(x))) NA_character_ 
    else as.character(na.omit(x))[sample(length(na.omit(x)), 1)]) , by = GROUP]

Or using aggregate from base R:

aggregate(df1[ , names(df1) != "GROUP"], by=list(df1$GROUP), 
  function(ii) ifelse(length(na.omit(ii)) == 0, 
    NA,
    as.character(na.omit(ii))[sample(length(na.omit(ii)), 1)])) 
    # Note use of as.character in case of factors
#  Group.1 X1   X2 X3
#1  GROUP1  A    T  T
#2  GROUP2  G <NA>  C

As thelatemail mentioned, the issue you are encountering is most likely due to variables being factors, as your code works when X1-X3 are characters. Any of the above solutions should work with factors.

回答2:

Using dplyr, you can do something like this:

library(dplyr)

sampleValue <- function(x) {
  ifelse(sum(is.na(x)) == length(x), x[NA], sample(x[!is.na(x)], 1))
}

df <- data.frame(GROUP = c('GROUP1', 'GROUP1', 'GROUP1', 'GROUP2', 'GROUP2'),
                 X1 = c('A', NA, 'C', NA, 'G'),
                 X2  = c(NA, NA, 'T', NA, NA),
                 X3 = c(NA, 'T', 'G', 'C', 'T'),
                 stringsAsFactors = FALSE)
df %>% group_by(GROUP) %>% summarise_each(funs(sampleValue), -GROUP)

The function is selecting a sampled value from the vector of values supplied if not all values are NA, and returns NA if they are all NA. You invoke this function for each group and each column using the code line in the end.

The output as follows (note the output changes for different runs since there is random sampling involved):

Source: local data frame [2 x 4]

   GROUP    X1    X2    X3
   (chr) (chr) (chr) (chr)
1 GROUP1     A     T     T
2 GROUP2     G    NA     C

来源：https://stackoverflow.com/questions/34711685/sample-a-single-row-per-column-with-substantial-missing-data

标签

if-statement

dataframe

data.table

missing-data