How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be ap
Each local data sets can converted into stdv and covariances. Also stdv and covariance and sum make correlation.
This is working example https://github.com/jeesim2/distributed-correlation