Which pyspark abstraction is appropriate for my large matrix multiplication?

问题

I want to perform a large matrix multiplication C = A * B.T and then filter C by applying a stringent threshold, collecting a list of the form (row index, column index, value).

A and B are sparse, with mostly zero entries. They are initially represented as sparse scipy csr matrices.

Sizes of the matrices (when they are in dense format):
A: 9G (900,000 x 1200)
B: 6.75G (700,000 x 1200)
C, before thresholding: 5000G
C, after thresholding: 0.5G

Using pyspark, what strategy would you expect to be most efficient here? Which abstraction should I use to parallelize A and B? What else should I be thinking about to optimize the partition sizes?

Should I stick with my scipy sparse matrices objects and simply parallelize them into RDDs (perhaps with some custom serialization)?

Should I store the non-zero entries of my A and B matrices using a DataFrame, then convert them to local pyspark matrix types when they are on the executors?

Should I use a DistributedMatrix abstraction from MLlib? For this strategy, I think I would first convert my scipy csr matrices to coo format, then create a pyspark CoordinateMatrix, then convert to either

BlockMatrix? Dense representation but allows matrix multiplication w/ another distributed BlockMatrix.
IndexedRowMatrix? Sparse representation but only allows matrix multiplication with a local matrix (e.g. a broadcast SparseMatrix ?)

*EDIT Going through the docs was also happy to discover the IndexedRowMatrix function columnSimilarities(), which may be a good option when the goal is computing cosine similarity.

I am looking for a local solution for now. I have two machines available for prototyping: either 16G RAM, 10 CPUs or 64G RAM, 28 CPUs. Planning to run this on a cluster once I have a good prototype.

来源：https://stackoverflow.com/questions/56297660/which-pyspark-abstraction-is-appropriate-for-my-large-matrix-multiplication

标签

python

apache-spark

pyspark

sparse-matrix

cosine-similarity