问题
I'd like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values respectively. My aim is to avoid simply cropping these outliers from the data entirely.
Any advice would be much appreciated, I can't find any information on how to do this anywhere else.
回答1:
This would do it.
fun <- function(x){
quantiles <- quantile( x, c(.05, .95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun( yourdata )
回答2:
You can do it in one line of code using squish()
:
d2 <- squish(d, quantile(d, c(.05, .95)))
In the scales library, look at ?squish
and ?discard
#--------------------------------
library(scales)
pr <- .95
q <- quantile(d, c(1-pr, pr))
d2 <- squish(d, q)
#---------------------------------
# Note: depending on your needs, you may want to round off the quantile, ie:
q <- round(quantile(d, c(1-pr, pr)))
example:
d <- 1:20
d
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
d2 <- squish(d, round(quantile(d, c(.05, .95))))
d2
# [1] 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19
回答3:
I used this code to get what you need:
qn = quantile(df$value, c(0.05, 0.95), na.rm = TRUE)
df = within(df, { value = ifelse(value < qn[1], qn[1], value)
value = ifelse(value > qn[2], qn[2], value)})
where df
is your data.frame, and value
the column that contains your data.
回答4:
There is a better way to solve this problem. An outlier is not any point over the 95th percentile or below the 5th percentile. Instead, an outlier is considered so if it is below the first quartile – 1.5·IQR or above third quartile + 1.5·IQR.
This website will explain in more thoroughly
To know more about outlier treatment refer here
capOutlier <- function(x){
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
return(x)
}
df$colName=capOutlier(df$colName)
Do the above line over and over for all of the columns in your data frame
来源:https://stackoverflow.com/questions/13339685/how-to-replace-outliers-with-the-5th-and-95th-percentile-values-in-r