问题
The following codes iterates over each element of two array to compute pairwise euclidean distance.
def compute_distances_two_loops(X, Y):
num_test = X.shape[0]
num_train = Y.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
dists[i][j] = np.sqrt(np.sum((X[i] - Y[j])**2))
return dists
The following code serves the same purpose but with single loop.
def compute_distances_one_loop(X, Y):
num_test = X.shape[0]
num_train = Y.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
dists[i, :] = np.sqrt(np.sum((Y - X[i])**2, axis=1))
return dists
Below are time comparison for both.
two_loop_time = time_function(compute_distances_two_loops, X, Y)
print ('Two loop version took %f seconds' % two_loop_time)
>> Two loop version took 20.3 seconds
one_loop_time = time_function(compute_distances_one_loop, X, Y)
print ('One loop version took %f seconds' % one_loop_time)
>> One loop version took 80.9 seconds
Both X and Y are numpy.ndarray with shape -
X: (500, 3000) Y: (5000, 3000)
Out of intuition the results are not correct, the single loop should run at least with same speed. What am I missing here ?
PS: The result is not from a single run. I ran the code number of times, on different machines, the results are similar.
回答1:
The reason is size of arrays within the loop body. In the two loop variant works on two arrays of 3000 elements. This easily fits into at least the level 2 cache of a cpu which is much faster than the main memory but it is also large enough that computing the distance is much slower than the python loop iteration overhead.
The second case the loop body works on 5000 * 3000 elements. This is so much that the data needs go to main memory in each computation step (first the Y-X[i] subtraction into a temporary array, squaring the temporary into another temporary and then read it back to sum it). The main memory is much slower than the cpu for the simple operations involved so it takes much longer despite removing a loop. You could speed it up a bit by using inplace operations writing into preallocated temporary array, but it will still be slower than the two loop variant for these array sizes.
Note that scipy.spatial.distance.cdist(X, Y)
is probably going to be the fastest, as it does not need any temporaries at all
来源:https://stackoverflow.com/questions/39502630/numpy-single-loop-vectorized-code-slow-compared-to-two-loop-iteration