How to replace outliers with the 5th and 95th percentile values in R

风流意气都作罢 提交于 2019-12-12 08:12:15

问题


I'd like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values respectively. My aim is to avoid simply cropping these outliers from the data entirely.

Any advice would be much appreciated, I can't find any information on how to do this anywhere else.


回答1:


This would do it.

fun <- function(x){
    quantiles <- quantile( x, c(.05, .95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
    x[ x > quantiles[2] ] <- quantiles[2]
    x
}
fun( yourdata )



回答2:


You can do it in one line of code using squish():

d2 <- squish(d, quantile(d, c(.05, .95)))



In the scales library, look at ?squish and ?discard

#--------------------------------
library(scales)

pr <- .95
q  <- quantile(d, c(1-pr, pr))
d2 <- squish(d, q)
#---------------------------------

# Note: depending on your needs, you may want to round off the quantile, ie:
q <- round(quantile(d, c(1-pr, pr)))

example:

d <- 1:20
d
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


d2 <- squish(d, round(quantile(d, c(.05, .95))))
d2
# [1]  2  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 19



回答3:


I used this code to get what you need:

qn = quantile(df$value, c(0.05, 0.95), na.rm = TRUE)
df = within(df, { value = ifelse(value < qn[1], qn[1], value)
                  value = ifelse(value > qn[2], qn[2], value)})

where df is your data.frame, and value the column that contains your data.




回答4:


There is a better way to solve this problem. An outlier is not any point over the 95th percentile or below the 5th percentile. Instead, an outlier is considered so if it is below the first quartile – 1.5·IQR or above third quartile + 1.5·IQR.
This website will explain in more thoroughly

To know more about outlier treatment refer here

capOutlier <- function(x){
   qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
   caps <- quantile(x, probs=c(.05, .95), na.rm = T)
   H <- 1.5 * IQR(x, na.rm = T)
   x[x < (qnt[1] - H)] <- caps[1]
   x[x > (qnt[2] + H)] <- caps[2]
   return(x)
}
df$colName=capOutlier(df$colName)
Do the above line over and over for all of the columns in your data frame


来源:https://stackoverflow.com/questions/13339685/how-to-replace-outliers-with-the-5th-and-95th-percentile-values-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!