问题
I want to perform a large matrix multiplication C = A * B.T and then filter C by applying a stringent threshold, collecting a list of the form (row index, column index, value).
A and B are sparse, with mostly zero entries. They are initially represented as sparse scipy csr matrices.
Sizes of the matrices (when they are in dense format):
A: 9G (900,000 x 1200)
B: 6.75G (700,000 x 1200)
C, before thresholding: 5000G
C, after thresholding: 0.5G
Using pyspark, what strategy would you expect to be most efficient here? Which abstraction should I use to parallelize A and B? What else should I be thinking about to optimize the partition sizes?
Should I stick with my scipy sparse matrices objects and simply parallelize them into RDDs (perhaps with some custom serialization)?
Should I store the non-zero entries of my A and B matrices using a DataFrame, then convert them to local pyspark matrix types when they are on the executors?
Should I use a DistributedMatrix abstraction from MLlib? For this strategy, I think I would first convert my scipy csr matrices to coo format, then create a pyspark CoordinateMatrix, then convert to either
- BlockMatrix? Dense representation but allows matrix multiplication w/ another distributed BlockMatrix.
- IndexedRowMatrix? Sparse representation but only allows matrix multiplication with a local matrix (e.g. a broadcast SparseMatrix ?)
*EDIT Going through the docs was also happy to discover the IndexedRowMatrix function columnSimilarities(), which may be a good option when the goal is computing cosine similarity.
I am looking for a local solution for now. I have two machines available for prototyping: either 16G RAM, 10 CPUs or 64G RAM, 28 CPUs. Planning to run this on a cluster once I have a good prototype.
来源:https://stackoverflow.com/questions/56297660/which-pyspark-abstraction-is-appropriate-for-my-large-matrix-multiplication