Correlation coefficients for sparse matrix in python?

前端 未结 4 1915
[愿得一人]
[愿得一人] 2021-02-07 11:31

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will

4条回答
  •  庸人自扰
    2021-02-07 12:04

    You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:

    import numpy as np
    from scipy import sparse
    
    def sparse_corrcoef(A, B=None):
    
        if B is not None:
            A = sparse.vstack((A, B), format='csr')
    
        A = A.astype(np.float64)
        n = A.shape[1]
    
        # Compute the covariance matrix
        rowsum = A.sum(1)
        centering = rowsum.dot(rowsum.T.conjugate()) / n
        C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
    
        # The correlation coefficients are given by
        # C_{i,j} / sqrt(C_{i} * C_{j})
        d = np.diag(C)
        coeffs = C / np.sqrt(np.outer(d, d))
    
        return coeffs
    

    Check that it works OK:

    # some smallish sparse random matrices
    a = sparse.rand(100, 100000, density=0.1, format='csr')
    b = sparse.rand(100, 100000, density=0.1, format='csr')
    
    coeffs1 = sparse_corrcoef(a, b)
    coeffs2 = np.corrcoef(a.todense(), b.todense())
    
    print(np.allclose(coeffs1, coeffs2))
    # True
    

    Be warned:

    The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

提交回复
热议问题