Correlation coefficients for sparse matrix in python?

前端 未结 4 1909
[愿得一人]
[愿得一人] 2021-02-07 11:31

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will

相关标签:
4条回答
  • 2021-02-07 11:57

    You do not need to introduce a large dense matrix. Just keep it sparse using Numpy:

    import numpy as np    
    def sparse_corr(A):
        N = A.shape[0]
        C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
        V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
        COR = np.divide(C,V+1e-119)
        return COR
    

    Testing the performance:

    A = sparse.rand(1000000, 100, density=0.1, format='csr')
    sparse_corr(A)
    
    0 讨论(0)
  • 2021-02-07 12:04

    You can compute the correlation coefficients fairly straightforwardly from the covariance matrix like this:

    import numpy as np
    from scipy import sparse
    
    def sparse_corrcoef(A, B=None):
    
        if B is not None:
            A = sparse.vstack((A, B), format='csr')
    
        A = A.astype(np.float64)
        n = A.shape[1]
    
        # Compute the covariance matrix
        rowsum = A.sum(1)
        centering = rowsum.dot(rowsum.T.conjugate()) / n
        C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
    
        # The correlation coefficients are given by
        # C_{i,j} / sqrt(C_{i} * C_{j})
        d = np.diag(C)
        coeffs = C / np.sqrt(np.outer(d, d))
    
        return coeffs
    

    Check that it works OK:

    # some smallish sparse random matrices
    a = sparse.rand(100, 100000, density=0.1, format='csr')
    b = sparse.rand(100, 100000, density=0.1, format='csr')
    
    coeffs1 = sparse_corrcoef(a, b)
    coeffs2 = np.corrcoef(a.todense(), b.todense())
    
    print(np.allclose(coeffs1, coeffs2))
    # True
    

    Be warned:

    The amount of memory required for computing the covariance matrix C will be heavily dependent on the sparsity structure of A (and B, if given). For example, if A is an (m, n) matrix containing just a single column of non-zero values then C will be an (n, n) matrix containing all non-zero values. If n is large then this could be very bad news in terms of memory consumption.

    0 讨论(0)
  • 2021-02-07 12:15

    I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.

    I also try to minimize calculations using the following strategy:

    • Process each column
    • Start at the current column + 1 and calculate correlations moving to the right.
    • For any abs(correlation) >= threshold, mark the current column for removal and calculate no further correlations.
    • Perform these steps for each column in the dataset except the last.

    This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.

    Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.

    Check each column using this function:

    def get_corr_row(idx_num, sp_mat, thresh):
        # slice the column at idx_num
        cols = sp_mat.shape[1]
        x = sp_mat[:,idx_num].toarray().ravel()
        start = idx_num + 1
        
        # Now slice each column to the right of idx_num   
        for i in range(start, cols):
            y = sp_mat[:,i].toarray().ravel()
            # Check the pearson correlation
            corr, pVal = pearsonr(x,y)
            # Pearson ranges from -1 to 1.
            # We check both positive and negative correlations >= thresh using abs(corr)
            if abs(corr) >= thresh:
                # stop checking after finding the 1st correlation > thresh   
                return False
                # Mark column at idx_num for removal in the mask  
        return True  
        
    

    Run the column level correlation checks in parallel:

    from joblib import Parallel, delayed  
    import multiprocessing
    
    
    def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
        
        # we must make sure the matrix is in csc format 
        # before we start doing all these column slices!  
        sp_mat = sp_mat.tocsc()
        cols = sp_mat.shape[1]
        
        if n_jobs == -1:
            # Process the work on all available CPU cores
            num_cores = multiprocessing.cpu_count()
        else:
            # Process the work on the specified number of CPU cores
            num_cores = n_jobs
    
        # Return a mask of all columns to keep by calling get_corr_row() 
        # once for each column in the matrix     
        return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))
    

    General Usage:

    #Get the mask using your sparse matrix and threshold.
    corr_mask = Get_Corr_Mask(X_t_fpr, 0.95) 
    
    # Remove features that are >= 95% correlated
    X_t_fpr_corr = X_t_fpr[:,corr_mask]
    
    0 讨论(0)
  • 2021-02-07 12:19

    Unfortunately, Alt's answer didn't work out for me. The values given to the np.sqrt function where mostly negative, so the resulting covariance values were nan.

    I wasn't able to use ali_m's answer as well, because my matrix was too large that I couldn't fit the centering = rowsum.dot(rowsum.T.conjugate()) / n matrix in my memory (My matrix's dimensions are: 3.5*10^6 x 33)

    Instead, I used scikit-learn's StandardScaler to compute the standard sparse matrix and then used a multiplication to obtain the correlation matrix.

    from sklearn.preprocessing import StandardScaler
    
    def compute_sparse_correlation_matrix(A):
        scaler = StandardScaler(with_mean=False)
        scaled_A = scaler.fit_transform(A)  # Assuming A is a CSR or CSC matrix
        corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T @ scaled_A)
        return corr_matrix
    

    I believe that this approach is faster and more robust than the other mentioned approaches. Moreover, it also preserves the sparsity pattern of the input matrix.

    0 讨论(0)
提交回复
热议问题