问题
As an example of my data frame, which I will call df1
, I have GROUP1 with three rows of data, and GROUP2 with two rows of data. I have three variables, X1, X2, and X3:
GROUP X1 X2 X3
GROUP1 A NA NA
GROUP1 NA NA T
GROUP1 C T G
GROUP2 NA NA C
GROUP2 G NA T
I am halfway to my answer, based on a previous question and answer (Sample a single row, per column, within a subset of a data frame in R, while following conditions) except I am having issues using characters.
I would like to sample a single variable, per column from GROUP1, to make a new row representing GROUP1. I do not want to sample one single and complete row from GROUP1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP2, variable X2, above).
For example, after sampling, I could have as a result:
GROUP X1 X2 X3
GROUP1 A T T
GROUP2 G NA C
Only GROUP2, variable X2, can result in NA here. I actually have 300 taxa, 40 groups, 160000 variables, and a substantial number of NA's.
When I use:
library(data.table)
setDT(df1)[,lapply(.SD, function(x)
if(all(is.na(x))) NA_character_ else sample(na.omit(x),1)) , by = GROUP]
I end up with a warning:
Column 2 of result for group 2 is type 'character' but expecting type
'integer'. Column types must be consistent for each group.
However, this warning does not seem to apply to only those variables of groups composed entirely of NA's.
If I instead replace NA_character_ with NA_integer_, some columns result in the sum of non-NA rows for the group's variable, rather a sample from across the rows.
回答1:
You can use this data.table
call:
setDT(df1)[ , lapply(.SD,
function(x) x[!is.na(x)][sample(sum(!is.na(x)), 1)]), by = GROUP]
Or you can tweak your original one
setDT(df1)[,lapply(.SD, function(x)
if(all(is.na(x))) NA_character_
else as.character(na.omit(x))[sample(length(na.omit(x)), 1)]) , by = GROUP]
Or using aggregate
from base R:
aggregate(df1[ , names(df1) != "GROUP"], by=list(df1$GROUP),
function(ii) ifelse(length(na.omit(ii)) == 0,
NA,
as.character(na.omit(ii))[sample(length(na.omit(ii)), 1)]))
# Note use of as.character in case of factors
# Group.1 X1 X2 X3
#1 GROUP1 A T T
#2 GROUP2 G <NA> C
As thelatemail mentioned, the issue you are encountering is most likely due to variables being factor
s, as your code works when X1-X3 are characters. Any of the above solutions should work with factors
.
回答2:
Using dplyr, you can do something like this:
library(dplyr)
sampleValue <- function(x) {
ifelse(sum(is.na(x)) == length(x), x[NA], sample(x[!is.na(x)], 1))
}
df <- data.frame(GROUP = c('GROUP1', 'GROUP1', 'GROUP1', 'GROUP2', 'GROUP2'),
X1 = c('A', NA, 'C', NA, 'G'),
X2 = c(NA, NA, 'T', NA, NA),
X3 = c(NA, 'T', 'G', 'C', 'T'),
stringsAsFactors = FALSE)
df %>% group_by(GROUP) %>% summarise_each(funs(sampleValue), -GROUP)
The function is selecting a sampled value from the vector of values supplied if not all values are NA, and returns NA if they are all NA. You invoke this function for each group and each column using the code line in the end.
The output as follows (note the output changes for different runs since there is random sampling involved):
Source: local data frame [2 x 4]
GROUP X1 X2 X3
(chr) (chr) (chr) (chr)
1 GROUP1 A T T
2 GROUP2 G NA C
来源:https://stackoverflow.com/questions/34711685/sample-a-single-row-per-column-with-substantial-missing-data