How to replace NAs with row means if proportion of row-wise NAs is below a certain threshold?

问题

Apologies for the somewhat cumbersome question, but I am currently working on a mental health study. For one of the mental health screening tools there are 15 variables, each of which can have values of 0-3. The total score for each row/participant is then assigned by taking the sum of these 15 variables. The documentation for this tool states that if more than 20% of the values for a particular row/participant are missing, the total score should be taken as missing also, however if fewer than 20% of the values for a row are missing, each missing value should be assigned the mean of the remaining values for that row.

I decided that to do this I would have to calculate the proportion of NAs for each participant, calculate the mean of all 15 variables excluding NAs for each participant, and then use a conditional mutate statement (or something similar) that checked if the proportion of NAs was less than 20% and if so replaced NAs for the relevant columns with the mean value for that row, before finding the sum of all 15 variables for each row. The dataset also contains other columns besides these 15, so applying a function to all of the columns would not be useful.

To calculate the mean score without NAs I did the following:

mental$somatic_mean <- rowMeans(mental [, c("var1", "var2", "var3", 
"var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12","var13", "var14", "var15")], na.rm=TRUE)

And to calculate the proportion of NAs for each variable:

mental$somatic_na <- rowMeans(is.na(mental [, c("var1", "var2", 
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12", "var13", "var14", "var15")]))

However when I attempted the mutate() statement to alter the rows where fewer than 20% of values were NA I can't identify any code that works. I have tried a lot of permutations by this point, including the following for each variable:

mental_recode <- mental %>%
  rowwise() %>%
  mutate(var1 = if(somatic_na<0.2) 
  replace_na(list(var1= somatic_mean)))

Which returns:

"no applicable method for 'replace_na' applied to an object of class "list""

and attempting to do them all together without using mutate():

mental %>%
  rowwise() %>%
  if(somatic_na<0.2)
                     replace_na(list(var1 = somatic_mean,   var2= 
somatic_mean,   var3 = somatic_mean,   var4 = somatic_mean,  var5 = 
somatic_mean,  var6 = somatic_mean,  var7 = somatic_mean, var8 = 
somatic_mean,  var9 = somatic_mean,  var10 = somatic_mean,   var11 = 
somatic_mean,  var12 = somatic_mean,   var13 = somatic_mean,  var14 = 
somatic_mean,  var15 = somatic_mean ))

Which returns:

Error in if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  : 
  argument is not interpretable as logical
In addition: Warning message:
In if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  :
  the condition has length > 1 and only the first element will be used

I also tried using if_else() in conjunction with mutate() and setting the value to NA if the condition was not met, but could not get that to work after various permutations and error messages either.

EDIT: Dummy data can be generated by the following:

mental <- structure(list(id = 1:21, var1 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L, 
                               NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, 0L), var2 = c(0L, 
 0L, 1L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
2L, 0L, 1L, 1L), var3 = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L, 
1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 2L, 0L, 1L, 1L), var4 = c(1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 1L, 0L, 0L), var5 = c(0L, 0L, 0L, 1L, NA, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), var6 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, NA, 0L), var8 = c(0L, 
0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var9 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), var10 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, NA, 0L), var11 = c(1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, NA, 0L), var12 = c(1L, 
0L, 1L, 1L, NA, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 1L, 1L), var13 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 
0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, NA, 0L), var14 = c(1L, 
0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 
2L, 0L, 1L, 0L), var15 = c(1L, 0L, 2L, NA, NA, 0L, NA, 0L, 0L, 
0L, 0L, 0L, NA, NA, 0L, NA, NA, NA, NA, NA, 0L)), .Names = c("id", 
"var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8", 
"var9", "var10", "var11", "var12", "var13", "var14", "var15"), class =                                 
"data.frame", row.names = c(NA, 
-21L))

Does anyone know of code that would work for this sort of situation?

Thanks in advance!

回答1:

Here is a way to do it all in one chain using dplyr using your supplied data frame.

First create a vector of all column names of interest:

name_col <- colnames(mental)[2:16]

And now use dplyr

library(dplyr)

mental %>% 
  # First create the column of row means
  mutate(somatic_mean = rowMeans(.[name_col], na.rm = TRUE)) %>% 
  # Now calculate the proportion of NAs
  mutate(somatic_na = rowMeans(is.na(.[name_col]))) %>% 
  # Create this column for filtering out later
  mutate(somatic_usable = ifelse(somatic_na < 0.2,
                                 "yes", "no")) %>% 
  # Make the following replacement on a row basis 
  rowwise() %>%
  mutate_at(vars(name_col), # Designate eligible columns to check for NAs
            funs(replace(., 
                         is.na(.) & somatic_na < 0.2, # Both conditions need to be met
                         somatic_mean))) %>% # What we are subbing the NAs with
  ungroup() # Now ungroup the 'rowwise' in case you need to modify further

Now, if you wanted to only select the entries that have less than 20% NAs, you can pipe the above into the following:

filter(somatic_usable == "yes")

Also of note, if you wanted to instead make the condition less than or equal to 20%, you would need to replace the two somatic_na < 0.2 with somatic_na <= 0.2.

Hope this helps!

回答2:

Here's a way using just base R expressions and remember mathematical properties of sums and means:

# generate fake data
set.seed(123)

dat <- data.frame(
  ID = 1:10,
  matrix(sample(c(0:3, NA), 10 * 15, TRUE), nrow = 10, ncol = 15),
  'another_var' = 'foo',
  'second_var' = 'bar',
  stringsAsFactors = FALSE
)

var_names <- paste0('X', 1:15)

# add number of NAs to data

dat$na_num <- rowSums(is.na(dat[var_names]))

# add row sum

dat$row_sum <- rowSums(dat[var_names], na.rm = TRUE) 

# add row mean

dat$row_mean <- rowMeans(dat[var_names], na.rm = TRUE)

# add final sum

dat$final_sum <- dat$row_sum + dat$row_mean * dat$na_num

# recode final sum to be NA if prop > .2

dat$final_sum <- ifelse(rowMeans(is.na(dat[var_names])) > .2,
                        NA,
                        dat$final_sum)

Here's a function that does the same thing. Where you specify your data and then a character vector of your variable names.

total_sum_calculation <- function(data, var_names){
  # add number of NAs to data
  na_num <- rowSums(is.na(data[var_names]))

  # add row sum
  row_sum <- rowSums(data[var_names], na.rm = TRUE) 

  # add row mean
  row_mean <- rowMeans(data[var_names], na.rm = TRUE)

  # add final sum
  final_sum <- row_sum + row_mean * na_num

  # recode final sum to be NA if prop > .2
  ifelse(rowMeans(is.na(data[var_names])) > .2,
                          NA,
                          final_sum)

}

v_names <- paste0('var', 1:15)
total_sum_calculation(data = mental, var_names = v_names)

 [1] 6.000000 0.000000 8.000000 7.500000       NA 0.000000 3.214286 9.230769 6.000000 2.000000 1.000000 0.000000 4.285714
[14]       NA 5.357143 5.357143 5.357143 9.642857 1.071429       NA 3.000000

来源：https://stackoverflow.com/questions/45616527/how-to-replace-nas-with-row-means-if-proportion-of-row-wise-nas-is-below-a-certa

标签

if-statement

dplyr

missing-data