How to use Outlier Tests in R Code

前端 未结 4 1429
难免孤独
难免孤独 2020-12-04 18:52

As part of my data analysis workflow, I want to test for outliers, and then do my further calculation with and without those outliers.

I\'ve found the outlier pack

相关标签:
4条回答
  • 2020-12-04 19:17

    "It's hard". Much of this is context-dependent and you may have to embed this into your application:

    • Does the data drift, trend, or cycle ?
    • Is the variability fixed or is it itself variable ?
    • Are there other series you can use for 'benchmarking' ?

    Other than the outliers packages there is also the qcc package as the quality control literature covers this area.

    There are many other areas you could look at as e.g. the robust statistics Task View.

    0 讨论(0)
  • 2020-12-04 19:18

    If you're worried about outliers, instead on throwing them out, use a robust method. For example, instead of lm, use rlm.

    0 讨论(0)
  • 2020-12-04 19:19

    I agree with Dirk, It's hard. I would recomend first looking at why you might have outliers. An outlier is just a number that someone thinks is suspicious, it's not a concrete 'bad' value, and unless you can find a reason for it to be an outlier, you may have to live with the uncertainty.

    One thing you didn't mention was what kind of outlier you're looking at. Are your data clustered around a mean, do they have a particular distribution or is there some relationship between your data.

    Here's some examples

    First, we'll create some data, and then taint it with an outlier;

    > testout<-data.frame(X1=rnorm(50,mean=50,sd=10),X2=rnorm(50,mean=5,sd=1.5),Y=rnorm(50,mean=200,sd=25))
    > #Taint the Data
    > testout$X1[10]<-5
    > testout$X2[10]<-5
    > testout$Y[10]<-530
    
    > testout
             X1         X2        Y
    1  44.20043  1.5259458 169.3296
    2  40.46721  5.8437076 200.9038
    3  48.20571  3.8243373 189.4652
    4  60.09808  4.6609190 177.5159
    5  50.23627  2.6193455 210.4360
    6  43.50972  5.8212863 203.8361
    7  44.95626  7.8368405 236.5821
    8  66.14391  3.6828843 171.9624
    9  45.53040  4.8311616 187.0553
    10  5.00000  5.0000000 530.0000
    11 64.71719  6.4007245 164.8052
    12 54.43665  7.8695891 192.8824
    13 45.78278  4.9921489 182.2957
    14 49.59998  4.7716099 146.3090
    <snip>
    48 26.55487  5.8082497 189.7901
    49 45.28317  5.0219647 208.1318
    50 44.84145  3.6252663 251.5620
    

    It's often most usefull to examine the data graphically (you're brain is much better at spotting outliers than maths is)

    > #Use Boxplot to Review the Data
    > boxplot(testout$X1, ylab="X1")
    > boxplot(testout$X2, ylab="X2")
    > boxplot(testout$Y, ylab="Y")
    

    Then you can use a test. If the test returns a cut off value, or the value that might be an outlier, you can use ifelse to remove it

    > #Use Outlier test to remove individual values
    > testout$newX1<-ifelse(testout$X1==outlier(testout$X1),NA,testout$X1)
    > testout
             X1         X2        Y    newX1
    1  44.20043  1.5259458 169.3296 44.20043
    2  40.46721  5.8437076 200.9038 40.46721
    3  48.20571  3.8243373 189.4652 48.20571
    4  60.09808  4.6609190 177.5159 60.09808
    5  50.23627  2.6193455 210.4360 50.23627
    6  43.50972  5.8212863 203.8361 43.50972
    7  44.95626  7.8368405 236.5821 44.95626 
    8  66.14391  3.6828843 171.9624 66.14391 
    9  45.53040  4.8311616 187.0553 45.53040
    10  5.00000  5.0000000 530.0000       NA 
    11 64.71719  6.4007245 164.8052 64.71719 
    12 54.43665  7.8695891 192.8824 54.43665 
    13 45.78278  4.9921489 182.2957 45.78278 
    14 49.59998  4.7716099 146.3090 49.59998 
    15 45.07720  4.2355525 192.9041 45.07720 
    16 62.27717  7.1518606 186.6482 62.27717 
    17 48.50446  3.0712422 228.3253 48.50446 
    18 65.49983  5.4609713 184.8983 65.49983 
    19 44.38387  4.9305222 213.9378 44.38387 
    20 43.52883  8.3777627 203.5657 43.52883 
    <snip>
    49 45.28317  5.0219647 208.1318 45.28317 
    50 44.84145  3.6252663 251.5620 44.84145
    

    Or for more complicated examples, you can use stats to calculate critical cut off values, here using the Lund Test (See Lund, R. E. 1975, "Tables for An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 4, pp. 473-476. and Prescott, P. 1975, "An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 1, pp. 129-132.)

    > #Alternative approach using Lund Test
    > lundcrit<-function(a, n, q) {
    + # Calculates a Critical value for Outlier Test according to Lund
    + # See Lund, R. E. 1975, "Tables for An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 4, pp. 473-476.
    + # and Prescott, P. 1975, "An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 1, pp. 129-132.
    + # a = alpha
    + # n = Number of data elements
    + # q = Number of independent Variables (including intercept)
    + F<-qf(c(1-(a/n)),df1=1,df2=n-q-1,lower.tail=TRUE)
    + crit<-((n-q)*F/(n-q-1+F))^0.5
    + crit
    + }
    
    > testoutlm<-lm(Y~X1+X2,data=testout)
    
    > testout$fitted<-fitted(testoutlm)
    
    > testout$residual<-residuals(testoutlm)
    
    > testout$standardresid<-rstandard(testoutlm)
    
    > n<-nrow(testout)
    
    > q<-length(testoutlm$coefficients)
    
    > crit<-lundcrit(0.1,n,q)
    
    > testout$Ynew<-ifelse(abs(testout$standardresid)>crit,NA,testout$Y)
    
    > testout
             X1         X2        Y    newX1   fitted    residual standardresid
    1  44.20043  1.5259458 169.3296 44.20043 209.8467 -40.5171222  -1.009507695
    2  40.46721  5.8437076 200.9038 40.46721 231.9221 -31.0183107  -0.747624895
    3  48.20571  3.8243373 189.4652 48.20571 203.4786 -14.0134646  -0.335955648
    4  60.09808  4.6609190 177.5159 60.09808 169.6108   7.9050960   0.190908291
    5  50.23627  2.6193455 210.4360 50.23627 194.3285  16.1075799   0.391537883
    6  43.50972  5.8212863 203.8361 43.50972 222.6667 -18.8306252  -0.452070155
    7  44.95626  7.8368405 236.5821 44.95626 223.3287  13.2534226   0.326339981
    8  66.14391  3.6828843 171.9624 66.14391 148.8870  23.0754677   0.568829360
    9  45.53040  4.8311616 187.0553 45.53040 214.0832 -27.0279262  -0.646090667
    10  5.00000  5.0000000 530.0000       NA 337.0535 192.9465135   5.714275585
    11 64.71719  6.4007245 164.8052 64.71719 159.9911   4.8141018   0.118618011
    12 54.43665  7.8695891 192.8824 54.43665 194.7454  -1.8630426  -0.046004311
    13 45.78278  4.9921489 182.2957 45.78278 213.7223 -31.4266180  -0.751115595
    14 49.59998  4.7716099 146.3090 49.59998 201.6296 -55.3205552  -1.321042392
    15 45.07720  4.2355525 192.9041 45.07720 213.9655 -21.0613819  -0.504406009
    16 62.27717  7.1518606 186.6482 62.27717 169.2455  17.4027250   0.430262983
    17 48.50446  3.0712422 228.3253 48.50446 200.6938  27.6314695   0.667366651
    18 65.49983  5.4609713 184.8983 65.49983 155.2768  29.6214506   0.726319931
    19 44.38387  4.9305222 213.9378 44.38387 217.7981  -3.8603382  -0.092354925
    20 43.52883  8.3777627 203.5657 43.52883 228.9961 -25.4303732  -0.634725264
    <snip>
    49 45.28317  5.0219647 208.1318 45.28317 215.3075  -7.1756966  -0.171560291
    50 44.84145  3.6252663 251.5620 44.84145 213.1535  38.4084869   0.923804784
           Ynew
    1  169.3296
    2  200.9038
    3  189.4652
    4  177.5159
    5  210.4360
    6  203.8361
    7  236.5821
    8  171.9624
    9  187.0553
    10       NA
    11 164.8052
    12 192.8824
    13 182.2957
    14 146.3090
    15 192.9041
    16 186.6482
    17 228.3253
    18 184.8983
    19 213.9378
    20 203.5657
    <snip>
    49 208.1318
    50 251.5620
    

    Edit: I've just noticed an issue in my code. The Lund test produces a critical value that should be compared to the absolute value of the studantized residual (i.e. without sign)

    0 讨论(0)
  • 2020-12-04 19:32

    Try the outliers::score function. I don't advise removing the so called outlier's, but knowing your extreme observations is good.

    library(outliers)
    set.seed(1234)
    x = rnorm(10)
    [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247  0.5060559 -0.5747400 -0.5466319
    [9] -0.5644520 -0.8900378
    outs <- scores(x, type="chisq", prob=0.9)  # beyond 90th %ile based on chi-sq
    #> [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
    x[outs]  # most extreme
    #> [1] -2.345698
    

    You'll find more help with outlier detection here

    0 讨论(0)
提交回复
热议问题