问题
I have a data frame containing columns with integers, characters, and numerics. The actual data set is much larger than the example given below, but what is below is a passable and much smaller imitation.
I am trying to loop through the data and change any values greater than the mean + (3 * standard deviation)
and less than the mean - (3 * standard deviation)
to NA
in the numeric columns only. If a column contains an integer or character, the loop should skip it and continue onto the next column. Additionally, most columns already contain some NA
values and will have lots of values that fall within the mean +/- (3*sd)
. Those values need to remain as they are.
The ultimate goal of this script is to use it on future data sets with the same structure and while I am open to suggestions with packages, I would like to use loops if possible. However, I am far from an expert in R and will happily take any and all advice anyone has for me!
I have worked out a structure for the overall script, but it stops after the first next
statement.
The script:
data = data.frame(test_data)
for (i in colnames(data)){
if (class(data$i) == "numeric"){
m = mean(data$i, na.rm=TRUE)
sd = sd(data$i, na.rm=TRUE)
}
else
next
for (j in 1:nrow(data)){
if (data$i[j,] > (m + 3*sd)){
data$i[j,] <- NA
}
else if (data$i[j,] < (m - 3*sd)){
data$i[j,] <- NA
}
else
next
}
}
The data being used to test this script is as follows:
Trait1 = c(1.1, 1.2, 1.35, 1.1, 1.2, NA, 1000, 1.5, 1.4, 1.6)
Trait2 = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Trait3 = c(125.1, 119.3, 118.4, NA, 1.1, 122.3, 123.4, 125.7, 121.5, 121.7)
test_data = data.frame(Trait1, Trait2, Trait3)
Thank you in advance for any help you have to offer, I greatly appreciate it!
回答1:
Using dplyr
and converting the numeric variables to a z-score using scale()
, this can be simplified to:
library(dplyr)
test_data %>%
mutate_if(is.numeric, ~replace(.x, abs(scale(.x)) > 3, NA))
回答2:
Here is a solution without any loop (sorry:)) using map_df
function from purrr
package:
library(purrr)
Trait1 = c(1.1, 1.2, 1.35, 1.1, 1.2, NA, 1000, 1.5, 1.4, 1.6)
Trait2 = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Trait3 = c(125.1, 119.3, 118.4, NA, 1.1, 122.3, 123.4, 125.7, 121.5, 121.7)
test_data = data.frame(Trait1, Trait2, Trait3)
map_df(test_data,function(x) {
if(class(x) == "numeric"){
x[x <= (mean(x,na.rm = T) - 3*sd(x,na.rm = T)) | x>= (mean(x,na.rm = T) + 3*sd(x,na.rm = T))] = NA
}
return(x)
}
)
If you want your mean
and sd
calculation to be with NA
, change na.rm = T
into na.rm = F
.
NB: Pay attention to the fact that in this case you do not have any value greater or smaller than mean minus or plus three standard deviations. If you were thinking that 1000
in column Trait1
was your "suspicious" point, then think again as it is not greater than mean +3*sd
. I recommend testing on a different dataset.
回答3:
If you need to use a loop, the following should work:
for (i in colnames(data)){
if (class(data[,i]) == "numeric"){
m = mean(data[,i], na.rm=TRUE)
sd = sd(data[,i], na.rm=TRUE)
for (j in 1:nrow(data)){
if (is.na(data[j,i])==F&(data[j,i] > (m + 3*sd)|data[j,i] < (m - 3*sd))){
data[j,i] <- NA
}
}
}
}
This is mostly just a slimmed-down version of what you wrote, but the key differences are that 1) writing data$i
where i
is a string specifying a column name does not work and 2) if you don't specify that you need data[j,]
to not be NA
, then you may get an error when you try to run things like data[j,i] > (m + 3*sd)
. The other point, which is more stylistic, is that you don't strictly need to include all the else
statements. In particular, you can just include the for(j in...)
statement directly under the if(class...=="numeric")
clause, without else next
, because else next
just forces it to not run the rest if class!="numeric"
, but you've already specified that class
is "numeric"
, so you don't need to specify that again. Hope that makes sense and is helpful.
回答4:
For these kind of things, I've been using base::ifelse()
in conjunction with the tidyverse:
library(tidyverse)
library(magrittr)
library(tidylog)
test_data %<>%
# Mutate any variable if (and only if) it's numeric...
mutate_if(is.numeric,
# ...then, if it meets the following criteria...
~ ifelse(
test = .x > mean(.x, na.rm = TRUE) + 3 * sd(.x, na.rm = TRUE) |
.x < mean(.x, na.rm = TRUE) - 3 * sd(.x, na.rm = TRUE) |
.x %>% is.na,
# ...replace with NA. If it doesn't...
yes = NA,
# ...leave as is!
no = .x
))
Note the lambda function above, using ~
and .x
.
Echoing what Vitali said above, this code didn't change anything in the dummy data. To make absolutely sure, I loaded in tidylog
, which is a neat package that prints dataframe changes due to tidyverse functions whenever they are run.
Edit: thanks to Vitali for pointing out that the original code was not generalizable. I've also removed a lot of the fluff.
来源:https://stackoverflow.com/questions/58070880/looping-through-data-to-set-values-or-variable-as-na-in-r