问题
I would like to replace outliers in each column of a dataframe with NA.
If for example we define outliers as being any value greater than 3 standard deviations from the mean I can achieve this per variable with the code below.
Rather than specify each column individually I'd like to perform the same operation on all columns of df
in one call. Any pointers on how to do this?!
Thanks!
library(dplyr)
data("iris")
df <- iris %>%
select(Sepal.Length, Sepal.Width, Petal.Length)%>%
head(10)
# add a clear outlier to each variable
df[1, 1:3] = 99
# replace values above 3 SD's with NA
df_cleaned <- df %>%
mutate(Sepal.Length = replace(Sepal.Length, Sepal.Length > (abs(3 * sd(df$Sepal.Length, na.rm = TRUE))), NA))
回答1:
You need to use mutate_all()
, i.e.
library(dplyr)
df %>%
mutate_all(funs(replace(., . > (abs(3 * sd(., na.rm = TRUE))), NA)))
回答2:
Another option is base R
df[] <- lapply(df, function(x) replace(x, . > (abs(3 * sd(x, na.rm = TRUE))), NA))
or with colSds
from matrixStats
library(matrixStats)
df[df > abs(3 * colSds(as.matrix(df), na.rm = TRUE))] <- NA
来源:https://stackoverflow.com/questions/55745379/how-to-conditionally-replace-values-with-na-across-multiple-columns