Distributed cross correlation matrix computation

前端 未结 2 657
甜味超标
甜味超标 2021-01-12 02:37

How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be ap

相关标签:
2条回答
  • 2021-01-12 03:26

    Each local data sets can converted into stdv and covariances. Also stdv and covariance and sum make correlation.

    This is working example https://github.com/jeesim2/distributed-correlation

    0 讨论(0)
  • 2021-01-12 03:29

    To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.

    0 讨论(0)
提交回复
热议问题