A matrix version of cor.test()

后端 未结 5 1097
有刺的猬
有刺的猬 2020-11-27 03:23

Cor.test() takes vectors x and y as arguments, but I have an entire matrix of data that I want to test, pairwise. Cor() t

相关标签:
5条回答
  • 2020-11-27 03:55

    corr.test in the psych package is designed to do this:

    library("psych")
    data(sat.act)
    corr.test(sat.act)
    

    As noted in the comments, to replicate the p-values from the base cor.test() function over the entire matrix, then you need to turn off adjustment of the p-values for multiple comparisons (the default is to use Holm's method of adjustment):

     corr.test(sat.act, adjust = "none")
    

    [But be careful when interpreting those results!]

    0 讨论(0)
  • 2020-11-27 04:00

    If you're strictly after the pvalues in a matrix format from cor.test here's a solution shamelessly stolen from Vincent (LINK):

    cor.test.p <- function(x){
        FUN <- function(x, y) cor.test(x, y)[["p.value"]]
        z <- outer(
          colnames(x), 
          colnames(x), 
          Vectorize(function(i,j) FUN(x[,i], x[,j]))
        )
        dimnames(z) <- list(colnames(x), colnames(x))
        z
    }
    
    cor.test.p(mtcars)
    

    Note: Tommy also provides a faster solution though less easy to impliment. Oh and no for loops :)

    Edit I have a function v_outer in my qdapTools package that makes this task pretty easy:

    library(qdapTools)
    (out <- v_outer(mtcars, function(x, y) cor.test(x, y)[["p.value"]]))
    print(out, digits=4)  # for more digits
    
    0 讨论(0)
  • 2020-11-27 04:00

    "The accepted solution (corr.test function in the psych package) works, but is extremely slow for large matrices."

    If you use ci=FALSE, then the speed is much faster. By default, confidence intervals are found. However, this leads to a slight slowdown of speed. So, for just the rs, ts and ps, set ci=FALSE.

    0 讨论(0)
  • 2020-11-27 04:11

    Probably the easiest way is to use the rcorr() from Hmisc. It will only take a matrix, so use rcorr(as.matrix(x)) if your data is in a data.frame. It will return you a list with: 1) matrix of r pairwise, 2) matrix of pairwise n, 3) matrix of p values for the r's. It automatically ignores missing data.

    Ideally, a function of this kind should take data.frames too and also output confidence intervals in line with the 'New Statistics'.

    0 讨论(0)
  • 2020-11-27 04:11

    The accepted solution (corr.test function in the psych package) works, but is extremely slow for large matrices. I was working with a gene expression matrix (~20,000 by ~1,000) correlated to a drug sensitivity matrix (~1,000 by ~500) and I had to stop it because it was taking forever.

    I took some code from the psych package and used the cor() function directly instead and got much better results:

    # find (pairwise complete) correlation matrix between two matrices x and y
    # compare to corr.test(x, y, adjust = "none")
    n <- t(!is.na(x)) %*% (!is.na(y)) # same as count.pairwise(x,y) from psych package
    r <- cor(x, y, use = "pairwise.complete.obs") # MUCH MUCH faster than corr.test()
    cor2pvalue = function(r, n) {
      t <- (r*sqrt(n-2))/sqrt(1-r^2)
      p <- 2*(1 - pt(abs(t),(n-2)))
      se <- sqrt((1-r*r)/(n-2))
      out <- list(r, n, t, p, se)
      names(out) <- c("r", "n", "t", "p", "se")
      return(out)
    }
    # get a list with matrices of correlation, pvalues, standard error, etc.
    result = cor2pvalue(r,n)
    

    Even with two 100 x 200 matrices, the difference was staggering. A second or two versus 45 seconds.

    > system.time(test_func(x,y))
       user  system elapsed 
      0.308   2.452   0.130 
    > system.time(corr.test(x,y, adjust = "none"))
       user  system elapsed 
     45.004   3.276  45.814 
    
    0 讨论(0)
提交回复
热议问题