Remove highly correlated variables

前端 未结 3 1321
遥遥无期
遥遥无期 2021-01-30 05:21

I have a huge dataframe 5600 X 6592 and I want to remove any variables that are correlated to each other more than 0.99 I do know how to do this the long way, step by step i.e.

3条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-30 06:04

    I'm sure there are many ways to do this and certainly some better than this, but this should work. I basically just set the upper triangle to be zero and then remove any rows that have values over 0.99.

    > tmp <- cor(data)
    > tmp[upper.tri(tmp)] <- 0
    > diag(tmp) <- 0
    # Above two commands can be replaced with 
    # tmp[!lower.tri(tmp)] <- 0
    #
    > 
    > data.new <- data[,!apply(tmp,2,function(x) any(x > 0.99))]
    > head(data.new)
       V2 V3 V5
    1   2 10  4
    2   2 20 10
    3   5 10 31
    4   4 20  2
    5 366 10  2
    6  65 20  5
    

提交回复
热议问题