Remove outliers from correlation coefficient calculation

前端 未结 5 1933
有刺的猬
有刺的猬 2021-01-31 22:57

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

相关标签:
5条回答
  • 2021-01-31 23:10

    This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).

    Just to illustrate this dramatically, let's create two vectors x and y that are uncorrelated:

    set.seed(1)
    x <- rnorm(1000)
    y <- rnorm(1000)
    > cor(x,y)
    [1] 0.006401211
    

    Now let's add an outlier point (500,500):

    x <- c(x, 500)
    y <- c(y, 500)
    

    Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,

    > cor(x,y)
    [1] 0.995741
    

    If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust package:

    require(robust)
    > covRob(cbind(x,y), corr = TRUE)
    Call:
    covRob(data = cbind(x, y), corr = TRUE)
    
    Robust Estimate of Correlation: 
                x           y
    x  1.00000000 -0.02594260
    y -0.02594260  1.00000000
    

    You can play around with parameters of covRob to decide how to trim the data. UPDATE: There is also the rlm (robust linear regression) in the MASS package.

    0 讨论(0)
  • 2021-01-31 23:13

    You might try bootstrapping your data to find the highest correlation coefficient, e.g.:

    x <- cars$dist
    y <- cars$speed
    percent <- 0.9         # given in the question above
    n <- 1000              # number of resampling
    boot.cor <- replicate(n, {tmp <- sample(round(length(x)*percent), replace=FALSE); cor(x[tmp], y[tmp])})
    

    And after run max(boot.cor). Do not be dissapointed if all the correlation coefficients will be all the same :)

    0 讨论(0)
  • 2021-01-31 23:26

    If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

    Firstly, generate some dummy data:

    require(MASS) ## for mvrnorm()
    set.seed(1)
    dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
    dat <- data.frame(dat)
    names(dat) <- c("X","Y")
    plot(dat)
    

    Next, we fit the linear model and extract the residuals:

    res <- resid(mod <- lm(Y ~ X, data = dat))
    

    The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:

    res.qt <- quantile(res, probs = c(0.05,0.95))
    

    Select those observations with residuals in the middle 90% of the data:

    want <- which(res >= res.qt[1] & res <= res.qt[2])
    

    We can then visualise this, with the red points being those we will retain:

    plot(dat, type = "n")
    points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
    points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
    abline(mod, col = "blue", lwd = 2)
    

    The plot produced from the dummy data showing the selected points with the smallest residuals

    The correlations for the full data and the selected subset are:

    > cor(dat)
              X         Y
    X 1.0000000 0.8935235
    Y 0.8935235 1.0000000
    > cor(dat[want,])
              X         Y
    X 1.0000000 0.9272109
    Y 0.9272109 1.0000000
    > cor(dat[-want,])
             X        Y
    X 1.000000 0.739972
    Y 0.739972 1.000000
    

    Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:

    ares <- abs(res)
    absres.qt <- quantile(ares, prob = c(.9))
    abswant <- which(ares <= absres.qt)
    ## plot - virtually the same, but not quite
    plot(dat, type = "n")
    points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
    points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
    abline(mod, col = "blue", lwd = 2)
    

    With this slightly different subset, the correlation is slightly lower:

    > cor(dat[abswant,])
              X         Y
    X 1.0000000 0.9272032
    Y 0.9272032 1.0000000
    

    Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:

    > head(cooks.distance(mod))
               1            2            3            4            5            6 
    7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03
    

    and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:

    > any(cooks.distance(mod) > 1)
    [1] FALSE
    > any(cooks.distance(mod) > (4 * nrow(dat)))
    [1] FALSE
    

    none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

    Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.

    0 讨论(0)
  • 2021-01-31 23:27

    Using method = "spearman" in cor will be robust to contamination and is easy to implement since it only involves replacing cor(x, y) with cor(x, y, method = "spearman").

    Repeating Prasad's analysis but using Spearman correlations instead we find that the Spearman correlation is indeed robust to the contamination here, recovering the underlying zero correlation:

    set.seed(1)
    
    # x and y are uncorrelated
    x <- rnorm(1000)
    y <- rnorm(1000)
    cor(x,y)
    ## [1] 0.006401211
    
    # add contamination -- now cor says they are highly correlated
    x <- c(x, 500)
    y <- c(y, 500)
    cor(x, y)
    ## [1] 0.995741
    
    # but with method = "spearman" contamination is removed & they are shown to be uncorrelated
    cor(x, y, method = "spearman")
    ## [1] -0.007270813
    
    0 讨论(0)
  • 2021-01-31 23:33

    Here's another possibility with the outliers captured. Using a similar scheme as Prasad:

    library(mvoutlier)    
    set.seed(1)    
    x <- rnorm(1000)    
    y <- rnorm(1000)    
    xy <- cbind(x, y)    
    outliers <- aq.plot(xy, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975   
    cor.plot(x, y)    
    color.plot(xy)   
    dd.plot(xy)   
    uni.plot(xy)    
    

    In the other answers, 500 was stuck on the end of x and y as an outlier. That may, or may not cause a memory problem with your machine, so I dropped it down to 4 to avoid that.

    x1 <- c(x, 4)     
    y1 <- c(y, 4)    
    xy1 <- cbind(x1, y1)    
    outliers1 <- aq.plot(xy1, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975
    cor.plot(x1, y1)    
    color.plot(xy1)    
    dd.plot(xy1)    
    uni.plot(xy1)    
    

    Here are the images from the x1, y1, xy1 data:

    alt text

    alt text

    alt text

    0 讨论(0)
提交回复
热议问题