Sample a single row, per column, with substantial missing data

我们两清 提交于 2019-12-10 21:07:39

问题


As an example of my data frame, which I will call df1, I have GROUP1 with three rows of data, and GROUP2 with two rows of data. I have three variables, X1, X2, and X3:

GROUP          X1    X2   X3
GROUP1         A     NA   NA
GROUP1         NA    NA   T
GROUP1         C     T    G   
GROUP2         NA    NA   C
GROUP2         G     NA   T

I am halfway to my answer, based on a previous question and answer (Sample a single row, per column, within a subset of a data frame in R, while following conditions) except I am having issues using characters.

I would like to sample a single variable, per column from GROUP1, to make a new row representing GROUP1. I do not want to sample one single and complete row from GROUP1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP2, variable X2, above).

For example, after sampling, I could have as a result:

GROUP         X1    X2   X3
GROUP1        A     T    T
GROUP2        G     NA   C

Only GROUP2, variable X2, can result in NA here. I actually have 300 taxa, 40 groups, 160000 variables, and a substantial number of NA's.

When I use:

library(data.table)

setDT(df1)[,lapply(.SD, function(x)
if(all(is.na(x))) NA_character_ else sample(na.omit(x),1)) , by = GROUP]

I end up with a warning:

Column 2 of result for group 2 is type 'character' but expecting type    
'integer'. Column types must be consistent for each group.

However, this warning does not seem to apply to only those variables of groups composed entirely of NA's.

If I instead replace NA_character_ with NA_integer_, some columns result in the sum of non-NA rows for the group's variable, rather a sample from across the rows.


回答1:


You can use this data.table call:

setDT(df1)[ , lapply(.SD, 
  function(x) x[!is.na(x)][sample(sum(!is.na(x)), 1)]), by = GROUP]

Or you can tweak your original one

setDT(df1)[,lapply(.SD, function(x)
  if(all(is.na(x))) NA_character_ 
    else as.character(na.omit(x))[sample(length(na.omit(x)), 1)]) , by = GROUP]

Or using aggregate from base R:

aggregate(df1[ , names(df1) != "GROUP"], by=list(df1$GROUP), 
  function(ii) ifelse(length(na.omit(ii)) == 0, 
    NA,
    as.character(na.omit(ii))[sample(length(na.omit(ii)), 1)])) 
    # Note use of as.character in case of factors
#  Group.1 X1   X2 X3
#1  GROUP1  A    T  T
#2  GROUP2  G <NA>  C

As thelatemail mentioned, the issue you are encountering is most likely due to variables being factors, as your code works when X1-X3 are characters. Any of the above solutions should work with factors.




回答2:


Using dplyr, you can do something like this:

library(dplyr)

sampleValue <- function(x) {
  ifelse(sum(is.na(x)) == length(x), x[NA], sample(x[!is.na(x)], 1))
}

df <- data.frame(GROUP = c('GROUP1', 'GROUP1', 'GROUP1', 'GROUP2', 'GROUP2'),
                 X1 = c('A', NA, 'C', NA, 'G'),
                 X2  = c(NA, NA, 'T', NA, NA),
                 X3 = c(NA, 'T', 'G', 'C', 'T'),
                 stringsAsFactors = FALSE)
df %>% group_by(GROUP) %>% summarise_each(funs(sampleValue), -GROUP)

The function is selecting a sampled value from the vector of values supplied if not all values are NA, and returns NA if they are all NA. You invoke this function for each group and each column using the code line in the end.

The output as follows (note the output changes for different runs since there is random sampling involved):

Source: local data frame [2 x 4]

   GROUP    X1    X2    X3
   (chr) (chr) (chr) (chr)
1 GROUP1     A     T     T
2 GROUP2     G    NA     C


来源:https://stackoverflow.com/questions/34711685/sample-a-single-row-per-column-with-substantial-missing-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!