I have a huge dataframe 5600 X 6592 and I want to remove any variables that are correlated to each other more than 0.99 I do know how to do this the long way, step by step i.e.
I'm sure there are many ways to do this and certainly some better than this, but this should work. I basically just set the upper triangle to be zero and then remove any rows that have values over 0.99.
> tmp <- cor(data)
> tmp[upper.tri(tmp)] <- 0
> diag(tmp) <- 0
# Above two commands can be replaced with
# tmp[!lower.tri(tmp)] <- 0
#
>
> data.new <- data[,!apply(tmp,2,function(x) any(x > 0.99))]
> head(data.new)
V2 V3 V5
1 2 10 4
2 2 20 10
3 5 10 31
4 4 20 2
5 366 10 2
6 65 20 5