Remove outliers from correlation coefficient calculation

前端未结

关注

 5  1946

有刺的猬

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

相关标签:

5条回答

青春惊慌失措

2021-01-31 23:10
This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).

Just to illustrate this dramatically, let's create two vectors x and y that are uncorrelated:
```
set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
> cor(x,y)
[1] 0.006401211
```
Now let's add an outlier point (500,500):
```
x <- c(x, 500)
y <- c(y, 500)
```
Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,
```
> cor(x,y)
[1] 0.995741
```
If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust package:
```
require(robust)
> covRob(cbind(x,y), corr = TRUE)
Call:
covRob(data = cbind(x, y), corr = TRUE)

Robust Estimate of Correlation: 
            x           y
x  1.00000000 -0.02594260
y -0.02594260  1.00000000
```
You can play around with parameters of covRob to decide how to trim the data. UPDATE: There is also the rlm (robust linear regression) in the MASS package.
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2021-01-31 23:13
You might try bootstrapping your data to find the highest correlation coefficient, e.g.:
```
x <- cars$dist
y <- cars$speed
percent <- 0.9         # given in the question above
n <- 1000              # number of resampling
boot.cor <- replicate(n, {tmp <- sample(round(length(x)*percent), replace=FALSE); cor(x[tmp], y[tmp])})
```
And after run max(boot.cor). Do not be dissapointed if all the correlation coefficients will be all the same :)
0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2021-01-31 23:26
If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

Firstly, generate some dummy data:
```
require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)
```
Next, we fit the linear model and extract the residuals:
```
res <- resid(mod <- lm(Y ~ X, data = dat))
```
The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:
```
res.qt <- quantile(res, probs = c(0.05,0.95))
```
Select those observations with residuals in the middle 90% of the data:
```
want <- which(res >= res.qt[1] & res <= res.qt[2])
```
We can then visualise this, with the red points being those we will retain:
```
plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)
```
The correlations for the full data and the selected subset are:
```
> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000
```
Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:
```
ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)
```
With this slightly different subset, the correlation is slightly lower:
```
> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000
```
Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:
```
> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03
```
and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:
```
> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE
```
none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.
0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2021-01-31 23:27
Using method = "spearman" in cor will be robust to contamination and is easy to implement since it only involves replacing cor(x, y) with cor(x, y, method = "spearman").

Repeating Prasad's analysis but using Spearman correlations instead we find that the Spearman correlation is indeed robust to the contamination here, recovering the underlying zero correlation:
```
set.seed(1)

# x and y are uncorrelated
x <- rnorm(1000)
y <- rnorm(1000)
cor(x,y)
## [1] 0.006401211

# add contamination -- now cor says they are highly correlated
x <- c(x, 500)
y <- c(y, 500)
cor(x, y)
## [1] 0.995741

# but with method = "spearman" contamination is removed & they are shown to be uncorrelated
cor(x, y, method = "spearman")
## [1] -0.007270813
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

暖寄归人

2021-01-31 23:33

Here's another possibility with the outliers captured. Using a similar scheme as Prasad:

library(mvoutlier)    
set.seed(1)    
x <- rnorm(1000)    
y <- rnorm(1000)    
xy <- cbind(x, y)    
outliers <- aq.plot(xy, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975   
cor.plot(x, y)    
color.plot(xy)   
dd.plot(xy)   
uni.plot(xy)

In the other answers, 500 was stuck on the end of x and y as an outlier. That may, or may not cause a memory problem with your machine, so I dropped it down to 4 to avoid that.

x1 <- c(x, 4)     
y1 <- c(y, 4)    
xy1 <- cbind(x1, y1)    
outliers1 <- aq.plot(xy1, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975
cor.plot(x1, y1)    
color.plot(xy1)    
dd.plot(xy1)    
uni.plot(xy1)

Here are the images from the x1, y1, xy1 data:

alt text

0 讨论(0)