问题
I have written code using numpy that takes an array of size (m x n)... The rows (m) are individual observations comprised of (n) features... and creates a square distance matrix of size (m x m). This distance matrix is the distance of a given observation from all other observations. E.g. row 0 column 9 is the distance between observation 0 and observation 9.
import numpy as np
#import cupy as np
def l1_distance(arr):
return np.linalg.norm(arr, 1)
X = np.random.randint(low=0, high=255, size=(700,4096))
distance = np.empty((700,700))
for i in range(700):
for j in range(700):
distance[i,j] = l1_distance(X[i,:] - X[j,:])
I attempted this on GPU using cupy by umcommenting the second import statement, but obviously the double for loop is drastically inefficient. It takes numpy approx 6 seconds, but cupy takes 26 seconds. I understand why but it's not immediately clear to me how to parallelize this process.
I know I'm going to need to write a reduction kernel of some sort, but I can't think of how to construct one cupy array from iterative operations on elements of another array.
回答1:
Using broadcasting CuPy takes 0.10 seconds in a A100 GPU compared to NumPy which takes 6.6 seconds
for i in range(700):
distance[i,:] = np.abs(np.broadcast_to(X[i,:], X.shape) - X).sum(axis=1)
This vectorizes and makes the distance of one vector to all other ones in parallel.
来源:https://stackoverflow.com/questions/64476655/using-cupy-to-create-a-distance-matrix-from-another-matrix-on-gpu