Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef
that will
Unfortunately, Alt's answer didn't work out for me. The values given to the np.sqrt
function where mostly negative, so the resulting covariance values were nan.
I wasn't able to use ali_m's answer as well, because my matrix was too large that I couldn't fit the centering = rowsum.dot(rowsum.T.conjugate()) / n
matrix in my memory (My matrix's dimensions are: 3.5*10^6 x 33)
Instead, I used scikit-learn's StandardScaler to compute the standard sparse matrix and then used a multiplication to obtain the correlation matrix.
from sklearn.preprocessing import StandardScaler
def compute_sparse_correlation_matrix(A):
scaler = StandardScaler(with_mean=False)
scaled_A = scaler.fit_transform(A) # Assuming A is a CSR or CSC matrix
corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T @ scaled_A)
return corr_matrix
I believe that this approach is faster and more robust than the other mentioned approaches. Moreover, it also preserves the sparsity pattern of the input matrix.