问题
I was calculating a Cosine Similarity Matrix for sparse vectors, and the elements expected to be float numbers appeared to be 'nan'.
'visits' is a sparse matrix showing how many times each user has visited each website. This matrix used to have a shape 1 500 000 x 1500, but I converted it into sparse matrix, using coo_matrix().tocsc().
The task is to find out, how similar the websites are, so I decided to calculate the cosine metric between each two sites.
Here is my code:
cosine_distance_matrix = np.ndarray(shape = (visits.shape[1], visits.shape[1]))
def norm(x):
return np.sqrt(
x.T.dot(x)
)
for i in range(0, visits.shape[1]):
for k in range(0, i + 1):
normi_normk = norm(visits[:,i]) * norm(visits[:,k])
cosine_distance_matrix[i,k] = visits[:,i].T.dot(visits[:, k])/normi_normk
cosine_distance_matrix[k, i] = cosine_distance_matrix[i, k]
print cosine_distance_matrix
And this is what I have gotten! O_o
[[ 1. nan nan ..., nan nan nan]
[ nan 1. nan ..., nan nan nan]
[ nan nan 1. ..., nan nan nan]
...,
[ nan nan nan ..., 1. nan nan]
[ nan nan nan ..., nan 1. nan]
[ nan nan nan ..., nan nan 1.]]
This program was running for 3 hours... What's the reason of such a trash instead of float numbers?
回答1:
Try:
def norm(x):
return np.sqrt((x.T*x).A)
I constructed a smaller sample visits
matrix, and calculated cosine_distance_matrix
with your code. Mine had the diagonal of 1s, and lots of nan
on the off diagonal. I choose one of the nan
items, and looked the the corresponding i,k
calculation.
In [690]: normi_normk = norm(visits[:,i]) * norm(visits[:,k])
In [691]: normi_normk
Out[691]:
<1x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>
In [692]: normi_normk.A
Out[692]: array([[ 18707.57953344]])
visits
is a sparse matrix, so visits[:,i]
is also sparse matrix (1 column). Your norm
function returns a 1x1 sparse matrix.
For this pair, this dot
is 0, but it still a 1x1 sparse matrix:
In [718]: visits[:,i].T.dot(visits[:, k])
Out[718]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 0 stored elements in Compressed Sparse Column format>
The division of these sparse matricies is also sparse - and nan
.
In [717]: visits[:,i].T.dot(visits[:, k])/normi_normk
Out[717]: matrix([[ nan]])
But if I change normi_normk
to a scalar or dense array I get 0
In [722]: visits[:,i].T.dot(visits[:, k])/normi_normk.A
Out[722]: matrix([[ 0.]])
So we have to change this from a matrix/matrix
division, to something involving dense arrays or scalars. It can be changed in various ways. Rewriting the norm
to handle sparse matrices correctly is one.
In addition I'd suggest using:
(visits[:,i].T*visits[:, k]).A/normi_normk
so that both terms of the division are dense.
Another possibility is to use visits[:,i].A
and visits[:,k].A
, so the inner loop calculations are done with dense arrays rather than these matrices.
Note that I'm not doing anything advanced or special. I just examined in detail one of the problem calculations, and found the source of the nan
.
I would also suggest using np.zeros
to initialize the array. I only use ndarray
when the normal zeros
, ones
, empty
don't work.
cosine_distance_matrix = np.zeros((visits.shape[1], visits.shape[1]))
In the big picture it would best to avoid looping over i
and k
, doing everything with matrix products and such. But this fix will get you going.
来源:https://stackoverflow.com/questions/33651788/cosine-similarity-yields-nan-values