Which pyspark abstraction is appropriate for my large matrix multiplication?

扶醉桌前 提交于 2019-12-11 06:14:21

问题


I want to perform a large matrix multiplication C = A * B.T and then filter C by applying a stringent threshold, collecting a list of the form (row index, column index, value).

A and B are sparse, with mostly zero entries. They are initially represented as sparse scipy csr matrices.

Sizes of the matrices (when they are in dense format):
A: 9G (900,000 x 1200)
B: 6.75G (700,000 x 1200)
C, before thresholding: 5000G
C, after thresholding: 0.5G

Using pyspark, what strategy would you expect to be most efficient here? Which abstraction should I use to parallelize A and B? What else should I be thinking about to optimize the partition sizes?


Should I stick with my scipy sparse matrices objects and simply parallelize them into RDDs (perhaps with some custom serialization)?

Should I store the non-zero entries of my A and B matrices using a DataFrame, then convert them to local pyspark matrix types when they are on the executors?

Should I use a DistributedMatrix abstraction from MLlib? For this strategy, I think I would first convert my scipy csr matrices to coo format, then create a pyspark CoordinateMatrix, then convert to either

  1. BlockMatrix? Dense representation but allows matrix multiplication w/ another distributed BlockMatrix.
  2. IndexedRowMatrix? Sparse representation but only allows matrix multiplication with a local matrix (e.g. a broadcast SparseMatrix ?)

*EDIT Going through the docs was also happy to discover the IndexedRowMatrix function columnSimilarities(), which may be a good option when the goal is computing cosine similarity.


I am looking for a local solution for now. I have two machines available for prototyping: either 16G RAM, 10 CPUs or 64G RAM, 28 CPUs. Planning to run this on a cluster once I have a good prototype.

来源:https://stackoverflow.com/questions/56297660/which-pyspark-abstraction-is-appropriate-for-my-large-matrix-multiplication

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!