Numpy: Single loop vectorized code slow compared to two loop iteration

▼魔方 西西 提交于 2020-01-06 21:07:21

问题


The following codes iterates over each element of two array to compute pairwise euclidean distance.

def compute_distances_two_loops(X, Y):
    num_test = X.shape[0]
    num_train = Y.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        for j in range(num_train):
            dists[i][j] = np.sqrt(np.sum((X[i] - Y[j])**2))
    return dists

The following code serves the same purpose but with single loop.

def compute_distances_one_loop(X, Y):
    num_test = X.shape[0]
    num_train = Y.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        dists[i, :] = np.sqrt(np.sum((Y - X[i])**2, axis=1))
    return dists

Below are time comparison for both.

two_loop_time = time_function(compute_distances_two_loops, X, Y)
print ('Two loop version took %f seconds' % two_loop_time)

>> Two loop version took 20.3 seconds

one_loop_time = time_function(compute_distances_one_loop, X, Y)
print ('One loop version took %f seconds' % one_loop_time)

>> One loop version took 80.9 seconds

Both X and Y are numpy.ndarray with shape -

X: (500, 3000) Y: (5000, 3000)

Out of intuition the results are not correct, the single loop should run at least with same speed. What am I missing here ?

PS: The result is not from a single run. I ran the code number of times, on different machines, the results are similar.


回答1:


The reason is size of arrays within the loop body. In the two loop variant works on two arrays of 3000 elements. This easily fits into at least the level 2 cache of a cpu which is much faster than the main memory but it is also large enough that computing the distance is much slower than the python loop iteration overhead.

The second case the loop body works on 5000 * 3000 elements. This is so much that the data needs go to main memory in each computation step (first the Y-X[i] subtraction into a temporary array, squaring the temporary into another temporary and then read it back to sum it). The main memory is much slower than the cpu for the simple operations involved so it takes much longer despite removing a loop. You could speed it up a bit by using inplace operations writing into preallocated temporary array, but it will still be slower than the two loop variant for these array sizes.

Note that scipy.spatial.distance.cdist(X, Y) is probably going to be the fastest, as it does not need any temporaries at all



来源:https://stackoverflow.com/questions/39502630/numpy-single-loop-vectorized-code-slow-compared-to-two-loop-iteration

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!