Looping through data to set values > or < variable as NA in R

问题

I have a data frame containing columns with integers, characters, and numerics. The actual data set is much larger than the example given below, but what is below is a passable and much smaller imitation.

I am trying to loop through the data and change any values greater than the mean + (3 * standard deviation) and less than the mean - (3 * standard deviation) to NA in the numeric columns only. If a column contains an integer or character, the loop should skip it and continue onto the next column. Additionally, most columns already contain some NA values and will have lots of values that fall within the mean +/- (3*sd). Those values need to remain as they are.

The ultimate goal of this script is to use it on future data sets with the same structure and while I am open to suggestions with packages, I would like to use loops if possible. However, I am far from an expert in R and will happily take any and all advice anyone has for me!

I have worked out a structure for the overall script, but it stops after the first next statement.

The script:

data = data.frame(test_data)

for (i in colnames(data)){
  if (class(data$i) == "numeric"){
    m = mean(data$i, na.rm=TRUE)
    sd = sd(data$i, na.rm=TRUE)
  }
    else
      next
  for (j in 1:nrow(data)){
    if (data$i[j,] > (m + 3*sd)){
      data$i[j,] <- NA
    }
    else if (data$i[j,] < (m - 3*sd)){
      data$i[j,] <- NA
    }
    else 
      next
    }
}

The data being used to test this script is as follows:

Trait1 = c(1.1, 1.2, 1.35, 1.1, 1.2, NA, 1000, 1.5, 1.4, 1.6)
Trait2 = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Trait3 = c(125.1, 119.3, 118.4, NA, 1.1, 122.3, 123.4, 125.7, 121.5, 121.7)
test_data = data.frame(Trait1, Trait2, Trait3)

Thank you in advance for any help you have to offer, I greatly appreciate it!

回答1:

Using dplyr and converting the numeric variables to a z-score using scale(), this can be simplified to:

library(dplyr)

test_data %>% 
  mutate_if(is.numeric, ~replace(.x, abs(scale(.x)) > 3, NA))

回答2:

Here is a solution without any loop (sorry:)) using map_df function from purrr package:

library(purrr)

Trait1 = c(1.1, 1.2, 1.35, 1.1, 1.2, NA, 1000, 1.5, 1.4, 1.6)
Trait2 = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Trait3 = c(125.1, 119.3, 118.4, NA, 1.1, 122.3, 123.4, 125.7, 121.5, 121.7)
test_data = data.frame(Trait1, Trait2, Trait3)

map_df(test_data,function(x) {
  if(class(x) == "numeric"){
    x[x <= (mean(x,na.rm = T) - 3*sd(x,na.rm = T)) | x>= (mean(x,na.rm = T) + 3*sd(x,na.rm = T))] = NA      
  }
  return(x)
}
)

If you want your mean and sd calculation to be with NA, change na.rm = T into na.rm = F.

NB: Pay attention to the fact that in this case you do not have any value greater or smaller than mean minus or plus three standard deviations. If you were thinking that 1000 in column Trait1 was your "suspicious" point, then think again as it is not greater than mean +3*sd. I recommend testing on a different dataset.

回答3:

If you need to use a loop, the following should work:

for (i in colnames(data)){
  if (class(data[,i]) == "numeric"){
    m = mean(data[,i], na.rm=TRUE)
    sd = sd(data[,i], na.rm=TRUE)
    for (j in 1:nrow(data)){
      if (is.na(data[j,i])==F&(data[j,i] > (m + 3*sd)|data[j,i] < (m - 3*sd))){
        data[j,i] <- NA
      }
    }
  }
}

This is mostly just a slimmed-down version of what you wrote, but the key differences are that 1) writing data$i where i is a string specifying a column name does not work and 2) if you don't specify that you need data[j,] to not be NA, then you may get an error when you try to run things like data[j,i] > (m + 3*sd). The other point, which is more stylistic, is that you don't strictly need to include all the else statements. In particular, you can just include the for(j in...) statement directly under the if(class...=="numeric") clause, without else next, because else next just forces it to not run the rest if class!="numeric", but you've already specified that class is "numeric", so you don't need to specify that again. Hope that makes sense and is helpful.

回答4:

For these kind of things, I've been using base::ifelse() in conjunction with the tidyverse:

library(tidyverse)
library(magrittr)
library(tidylog)

test_data %<>%

  # Mutate any variable if (and only if) it's numeric...
  mutate_if(is.numeric,

            # ...then, if it meets the following criteria...
            ~ ifelse(
              test = .x > mean(.x, na.rm = TRUE) + 3 * sd(.x, na.rm = TRUE) |
                     .x < mean(.x, na.rm = TRUE) - 3 * sd(.x, na.rm = TRUE) |
                     .x %>% is.na,

              # ...replace with NA. If it doesn't...
              yes = NA,

              # ...leave as is!
              no  = .x

            ))

Note the lambda function above, using ~ and .x.

Echoing what Vitali said above, this code didn't change anything in the dummy data. To make absolutely sure, I loaded in tidylog, which is a neat package that prints dataframe changes due to tidyverse functions whenever they are run.

Edit: thanks to Vitali for pointing out that the original code was not generalizable. I've also removed a lot of the fluff.

来源：https://stackoverflow.com/questions/58070880/looping-through-data-to-set-values-or-variable-as-na-in-r

标签

loops