Efficient computation of similarity matrix in Python (NumPy)

问题

Let X be a Bxn numpy matrix, i.e.,

import numpy as np
B = 10
n = 2
X = np.random.random((B, n))

Now, I'm interested in computing the so-called kernel (or even similarity) matrix K, which is of shape BxB, and its {i,j}-th element is given as follows:

K(i,j) = fun(x_i, x_j)

where x_t denotes the t-th row of matrix X and fun is some function of x_i, x_j. For instance, this function could be the so-called RBF function, i.e.,

K(i,j) = exp(-|x_i - x_j|^2).

For doing so, a naive way would be the following:

K = np.zeros((B, B))
for i in range(X.shape[0]):
    x_i = X[i, :]
    for j in range(X.shape[0]):
        x_j = X[j, :]
        K[i, j] = np.exp(-np.linalg.norm(x_i - x_j, 2) ** 2)

What I want is to do the above operation in a vectorized way, for the sake of efficiency. Could you help?

回答1:

I'm not sure that you can due this using only numpy. I would use the method cdist from the scipy library, something like this:

import numpy as np 
from scipy.spatial.distance import cdist
B=5
X=np.random.rand(B*B).reshape((B,B))
dist = cdist(X, X, metric='euclidean')
K = np.exp(dist)

dist
array([[ 0.        ,  1.2659804 ,  0.98231231,  0.80089176,  1.19326493],
       [ 1.2659804 ,  0.        ,  0.72658078,  0.80618767,  0.3776364 ],
       [ 0.98231231,  0.72658078,  0.        ,  0.70205336,  0.81352455],
       [ 0.80089176,  0.80618767,  0.70205336,  0.        ,  0.60025858],
       [ 1.19326493,  0.3776364 ,  0.81352455,  0.60025858,  0.        ]])
K
array([[ 1.        ,  3.5465681 ,  2.67062441,  2.22752646,  3.29783084],
       [ 3.5465681 ,  1.        ,  2.06799756,  2.23935453,  1.45883242],
       [ 2.67062441,  2.06799756,  1.        ,  2.01789192,  2.25584482],
       [ 2.22752646,  2.23935453,  2.01789192,  1.        ,  1.82259002],
       [ 3.29783084,  1.45883242,  2.25584482,  1.82259002,  1.        ]])

Hoping this can help you. Good work

EDIT You can also use only numpy array, for a theano implementaion:

dist = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (X ** 2).sum(1).reshape((1, X.shape[0])) - 2 * X.dot(X.T)

It should be work!

回答2:

This is certainly possible in numpy alone if you harness the power of broadcasting.

You just have to code out the inner distance-norm calculation in a vectorized way:

X1 = X[:, np.newaxis, :]
X2 = X[np.newaxis, :, :]
K = np.exp(-np.sum((X1 - X2)**2, axis=-1))

回答3:

Don't vectorize it, just compile it

This is nearly every time faster and the code is easier to read. Since a good jit compiler like Numba is available, this is a really simple thing to do.

In your case:

import numpy as np
import numba as nb
@nb.njit(fastmath=True)
def Test_1(X):
  K = np.zeros((B, B))
  for i in range(X.shape[0]):
      x_i = X[i, :]
      for j in range(X.shape[0]):
          x_j = X[j, :]
          K[i, j] = np.exp(-np.linalg.norm(x_i - x_j, 2) ** 2)

  return K

It is also very easy to parallelize functions:

import numpy as np
import numba as nb
@nb.njit(fastmath=True,parallel=True)
def Test_1(X):
  K = np.zeros((B, B))
  for i in nb.prange(X.shape[0]):
      x_i = X[i, :]
      for j in range(X.shape[0]):
          x_j = X[j, :]
          K[i, j] = np.exp(-np.linalg.norm(x_i - x_j, 2) ** 2)

  return K

This easily outperforms all other solutions provided so far. The first function call takes about 0.5s longer, because here your code is compiled, but i guess you want to call this function more than one time.

If you use the single-threaded version, you could also cache the compilation results. The caching of multithreaded code will likely be implemented soon.

来源：https://stackoverflow.com/questions/48907216/efficient-computation-of-similarity-matrix-in-python-numpy

标签

python

performance

numpy

vectorization

similarity