问题
Let X
be a Bxn
numpy
matrix, i.e.,
import numpy as np
B = 10
n = 2
X = np.random.random((B, n))
Now, I'm interested in computing the so-called kernel (or even similarity) matrix K
, which is of shape BxB
, and its {i,j}
-th element is given as follows:
K(i,j) = fun(x_i, x_j)
where x_t
denotes the t
-th row of matrix X
and fun
is some function of x_i
, x_j
. For instance, this function could be the so-called RBF function, i.e.,
K(i,j) = exp(-|x_i - x_j|^2).
For doing so, a naive way would be the following:
K = np.zeros((B, B))
for i in range(X.shape[0]):
x_i = X[i, :]
for j in range(X.shape[0]):
x_j = X[j, :]
K[i, j] = np.exp(-np.linalg.norm(x_i - x_j, 2) ** 2)
What I want is to do the above operation in a vectorized way, for the sake of efficiency. Could you help?
回答1:
I'm not sure that you can due this using only numpy. I would use the method cdist from the scipy library, something like this:
import numpy as np
from scipy.spatial.distance import cdist
B=5
X=np.random.rand(B*B).reshape((B,B))
dist = cdist(X, X, metric='euclidean')
K = np.exp(dist)
dist
array([[ 0. , 1.2659804 , 0.98231231, 0.80089176, 1.19326493],
[ 1.2659804 , 0. , 0.72658078, 0.80618767, 0.3776364 ],
[ 0.98231231, 0.72658078, 0. , 0.70205336, 0.81352455],
[ 0.80089176, 0.80618767, 0.70205336, 0. , 0.60025858],
[ 1.19326493, 0.3776364 , 0.81352455, 0.60025858, 0. ]])
K
array([[ 1. , 3.5465681 , 2.67062441, 2.22752646, 3.29783084],
[ 3.5465681 , 1. , 2.06799756, 2.23935453, 1.45883242],
[ 2.67062441, 2.06799756, 1. , 2.01789192, 2.25584482],
[ 2.22752646, 2.23935453, 2.01789192, 1. , 1.82259002],
[ 3.29783084, 1.45883242, 2.25584482, 1.82259002, 1. ]])
Hoping this can help you. Good work
EDIT You can also use only numpy array, for a theano implementaion:
dist = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (X ** 2).sum(1).reshape((1, X.shape[0])) - 2 * X.dot(X.T)
It should be work!
回答2:
This is certainly possible in numpy alone if you harness the power of broadcasting.
You just have to code out the inner distance-norm calculation in a vectorized way:
X1 = X[:, np.newaxis, :]
X2 = X[np.newaxis, :, :]
K = np.exp(-np.sum((X1 - X2)**2, axis=-1))
回答3:
Don't vectorize it, just compile it
This is nearly every time faster and the code is easier to read. Since a good jit compiler like Numba is available, this is a really simple thing to do.
In your case:
import numpy as np
import numba as nb
@nb.njit(fastmath=True)
def Test_1(X):
K = np.zeros((B, B))
for i in range(X.shape[0]):
x_i = X[i, :]
for j in range(X.shape[0]):
x_j = X[j, :]
K[i, j] = np.exp(-np.linalg.norm(x_i - x_j, 2) ** 2)
return K
It is also very easy to parallelize functions:
import numpy as np
import numba as nb
@nb.njit(fastmath=True,parallel=True)
def Test_1(X):
K = np.zeros((B, B))
for i in nb.prange(X.shape[0]):
x_i = X[i, :]
for j in range(X.shape[0]):
x_j = X[j, :]
K[i, j] = np.exp(-np.linalg.norm(x_i - x_j, 2) ** 2)
return K
This easily outperforms all other solutions provided so far. The first function call takes about 0.5s longer, because here your code is compiled, but i guess you want to call this function more than one time.
If you use the single-threaded version, you could also cache the compilation results. The caching of multithreaded code will likely be implemented soon.
来源:https://stackoverflow.com/questions/48907216/efficient-computation-of-similarity-matrix-in-python-numpy