Cosine similarity yields 'nan' values

问题

I was calculating a Cosine Similarity Matrix for sparse vectors, and the elements expected to be float numbers appeared to be 'nan'.

'visits' is a sparse matrix showing how many times each user has visited each website. This matrix used to have a shape 1 500 000 x 1500, but I converted it into sparse matrix, using coo_matrix().tocsc().

The task is to find out, how similar the websites are, so I decided to calculate the cosine metric between each two sites.

Here is my code:

cosine_distance_matrix = np.ndarray(shape = (visits.shape[1], visits.shape[1]))

def norm(x):
return np.sqrt(
    x.T.dot(x)
)

for i in range(0, visits.shape[1]):
  for k in range(0, i + 1):
    normi_normk = norm(visits[:,i]) * norm(visits[:,k])
    cosine_distance_matrix[i,k] = visits[:,i].T.dot(visits[:, k])/normi_normk
    cosine_distance_matrix[k, i] = cosine_distance_matrix[i, k]

print cosine_distance_matrix

And this is what I have gotten! O_o

[[  1.  nan  nan ...,  nan  nan  nan]
 [ nan   1.  nan ...,  nan  nan  nan]
 [ nan  nan   1. ...,  nan  nan  nan]
 ..., 
 [ nan  nan  nan ...,   1.  nan  nan]
 [ nan  nan  nan ...,  nan   1.  nan]
 [ nan  nan  nan ...,  nan  nan   1.]]

This program was running for 3 hours... What's the reason of such a trash instead of float numbers?

回答1:

Try:

def norm(x):
    return np.sqrt((x.T*x).A)

I constructed a smaller sample visits matrix, and calculated cosine_distance_matrix with your code. Mine had the diagonal of 1s, and lots of nan on the off diagonal. I choose one of the nan items, and looked the the corresponding i,k calculation.

In [690]: normi_normk = norm(visits[:,i]) * norm(visits[:,k])
In [691]: normi_normk
Out[691]: 
<1x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Column format>
In [692]: normi_normk.A
Out[692]: array([[ 18707.57953344]])

visits is a sparse matrix, so visits[:,i] is also sparse matrix (1 column). Your norm function returns a 1x1 sparse matrix.

For this pair, this dot is 0, but it still a 1x1 sparse matrix:

In [718]: visits[:,i].T.dot(visits[:, k])
Out[718]: 
<1x1 sparse matrix of type '<class 'numpy.int32'>'
    with 0 stored elements in Compressed Sparse Column format>

The division of these sparse matricies is also sparse - and nan.

In [717]: visits[:,i].T.dot(visits[:, k])/normi_normk
Out[717]: matrix([[ nan]])

But if I change normi_normk to a scalar or dense array I get 0

In [722]: visits[:,i].T.dot(visits[:, k])/normi_normk.A
Out[722]: matrix([[ 0.]])

So we have to change this from a matrix/matrix division, to something involving dense arrays or scalars. It can be changed in various ways. Rewriting the norm to handle sparse matrices correctly is one.

In addition I'd suggest using:

(visits[:,i].T*visits[:, k]).A/normi_normk

so that both terms of the division are dense.

Another possibility is to use visits[:,i].A and visits[:,k].A, so the inner loop calculations are done with dense arrays rather than these matrices.

Note that I'm not doing anything advanced or special. I just examined in detail one of the problem calculations, and found the source of the nan.

I would also suggest using np.zeros to initialize the array. I only use ndarray when the normal zeros, ones, empty don't work.

cosine_distance_matrix = np.zeros((visits.shape[1], visits.shape[1]))

In the big picture it would best to avoid looping over i and k, doing everything with matrix products and such. But this fix will get you going.

来源：https://stackoverflow.com/questions/33651788/cosine-similarity-yields-nan-values

标签

python

numpy

sparse-matrix

similarity

cosine-similarity