问题
The code below causes my system to run out of memory before it completes.
Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below?
I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix (mat
) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix.
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
mat = np.random.rand(65000, 10)
sparse_mat = sparse.csr_matrix(mat)
similarities = cosine_similarity(sparse_mat)
After running that last line I always run out of memory and the program either freezes or crashes with a MemoryError. This occurs whether I run on my 8 gb local RAM or on a 64 gb EC2 instance.
回答1:
Same problem here. I've got a big, non-sparse matrix. It fits in memory just fine, but cosine_similarity
crashes for whatever unknown reason, probably because they copy the matrix one time too many somewhere. So I made it compare small batches of rows "on the left" instead of the entire matrix:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity_n_space(m1, m2, batch_size=100):
assert m1.shape[1] == m2.shape[1]
ret = np.ndarray((m1.shape[0], m2.shape[0]))
for row_i in range(0, int(m1.shape[0] / batch_size) + 1):
start = row_i * batch_size
end = min([(row_i + 1) * batch_size, m1.shape[0]])
if end <= start:
break # cause I'm too lazy to elegantly handle edge cases
rows = m1[start: end]
sim = cosine_similarity(rows, m2) # rows is O(1) size
ret[start: end] = sim
return ret
No crashes for me; YMMV. Try different batch sizes to make it faster. I used to only compare 1 row at a time, and it took about 30X as long on my machine.
Stupid yet effective sanity check:
import random
while True:
m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
n = np.random.rand(random.randint(1, 100), m.shape[1])
assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))
回答2:
You're running out of memory because you're trying to store a 65000x65000 matrix. Note that the matrix you're constructing is not sparse at all. np.random.rand generates a random number between 0 and 1. So there aren't enough zeros for csr_matrix
to actually compress your data. In fact, there are almost surely no zeros at all.
If you look closely at your MemoryError
traceback, you can see that cosine_similarity tries to use the sparse dot product if possible:
MemoryError Traceback (most recent call last)
887 Y_normalized = normalize(Y, copy=True)
888
--> 889 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
890
891 return K
So the problem isn't with cosine_similarity
, it's with your matrix. Try initializing an actual sparse matrix (with 1% sparsity, for example) like this:
>>> a = np.zeros((65000, 10))
>>> i = np.random.rand(a.size)
>>> a.flat[i < 0.01] = 1 # Select 1% of indices and set to 1
>>> a = sparse.csr_matrix(a)
Then, on a machine with 32GB RAM (8GB RAM was not enough for me), the following runs with no memory error:
>>> b = cosine_similarity(a)
>>> b
array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
回答3:
I would run it in chunks like this
from sklearn.metrics.pairwise import cosine_similarity
# Change chunk_size to control resource consumption and speed
# Higher chunk_size means more memory/RAM needed but also faster
chunk_size = 500
matrix_len = your_matrix.shape[0] # Not sparse numpy.ndarray
def similarity_cosine_by_chunk(start, end):
if end > matrix_len:
end = matrix_len
return cosine_similarity(X=your_matrix[start:end], Y=your_matrix) # scikit-learn function
for chunk_start in xrange(0, matrix_len, chunk_size):
cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
# Handle cosine_similarity_chunk ( Write it to file_timestamp and close the file )
# Do not open the same file again or you may end up with out of memory after few chunks
来源:https://stackoverflow.com/questions/40900608/cosine-similarity-on-large-sparse-matrix-with-numpy