问题
Current data frame consists of numerical values. I am identifying outliers in my dataframe column by column, can I identify the outliers in the column at once and remove them in one go? Right now I am changing the values to NA
My Code:
quantiles<-tapply(var1,names,quantile)
minq <- sapply(names, function(x) quantiles[[x]]["25%"])
maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
var1[var1<minq | var1>maxq] <- NA
Data.
Data posted by the OP in a comment in dput
format.
df1 <-
structure(list(Var1 = c(100.2, 110, 200, 456, 120000),
var2 = c(NA, 4545L, 45465L, 44422L, 250000L),
var3 = c(NA, 210000L, 91500L, 215000L, 250000L),
var4 = c(0.983, 0.44, 0.983, 0.78, 2.23)),
class = "data.frame", row.names = c(NA, -5L))
回答1:
The following removes the outliers from the dataframe, but the result is a list, not a dataframe, since the resulting vectors are not all of the same length.
df2 <- lapply(df1, function(x){
qq <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
x[!is.na(x) & qq[1] <= x & x <= qq[2]]
})
Edit
Following this question by the same @user11368874, the code below is inspired in the first code above and answers that second question.
df3 <- df1
df3[] <- lapply(df1, function(x){
qq <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
is.na(x) <- x < qq[1] | x > qq[2]
x
})
df3
# Var1 var2 var3 var4
#1 NA NA NA 0.983
#2 110 NA 210000 NA
#3 200 45465 NA 0.983
#4 456 44422 215000 0.780
#5 NA NA NA NA
回答2:
The following function tests, which values in columns are outside of Tukey's fences (outliers below and above the 1st and the 3rd quartile). Then, depending on the user preference, the function removes all rows that contain any value with an outlier or replaces the outliers with NA
.
outlier.out <- function(dat, q = c(0.25, 0.75), out = TRUE){
# create a place for identification of outliers
tests <- matrix(NA, ncol = ncol(dat), nrow = nrow(dat))
# test, which cells contain outliers, ignoring existing NA values
for(i in 1:ncol(dat)){
qq <- quantile(dat[, i], q, na.rm = TRUE)
tests[, i] <- sapply(dat[, i] < qq[1] | dat[, i] > qq[2], isTRUE)
}
if(out){
# removes lines with outliers
dat <- dat[!apply(tests, 1, FUN = any, na.rm = TRUE) ,]
} else {
# replaces outliers with NA
dat[tests] <- NA
}
return(dat)
}
outlier.out(df1)
# Var1 var2 var3 var4
# 4 456 44422 215000 0.78
outlier.out(df1, out = FALSE)
# Var1 var2 var3 var4
# 1 NA NA NA 0.983
# 2 110 NA 210000 NA
# 3 200 45465 NA 0.983
# 4 456 44422 215000 0.780
# 5 NA NA NA NA
来源:https://stackoverflow.com/questions/56629367/identify-outliers-in-a-dataframe-in-r