A few years ago, someone posted on Active State Recipes for comparison purposes, three python/NumPy functions; each of these accepted the same arguments and returne
TL; DR The second code above is only looping over the number of dimensions of the points (3 times through the for loop for 3D points) so the looping isn't much there. The real speed-up in the second code above is that it better harnesses the power of Numpy to avoid creating some extra matrices when finding the differences between points. This reduces memory used and computational effort.
Longer Explanation
I think that the calcDistanceMatrixFastEuclidean2
function is deceiving you with its loop perhaps. It is only looping over the number of dimensions of the points. For 1D points, the loop only executes once, for 2D, twice, and for 3D, thrice. This is really not much looping at all.
Let's analyze the code a little bit to see why the one is faster than the other. calcDistanceMatrixFastEuclidean
I will call fast1
and calcDistanceMatrixFastEuclidean2
will be fast2
.
fast1
is based on the Matlab way of doing things as is evidenced by the repmap
function. The repmap
function creates an array in this case that is just the original data repeated over and over again. However, if you look at the code for the function, it is very inefficient. It uses many Numpy functions (3 reshape
s and 2 repeat
s) to do this. The repeat
function is also used to create an array that contains the the original data with each data item repeated many times. If our input data is [1,2,3]
then we are subtracting [1,2,3,1,2,3,1,2,3]
from [1,1,1,2,2,2,3,3,3]
. Numpy has had to create a lot of extra matrices in between running Numpy's C code which could have been avoided.
fast2
uses more of Numpy's heavy lifting without creating as many matrices between Numpy calls. fast2
loops through each dimension of the points, does the subtraction and keeps a running total of the squared differences between each dimension. Only at the end is the square root done. So far, this may not sound quite as efficient as fast1
, but fast2
avoids doing the repmat
stuff by using Numpy's indexing. Let's look at the 1D case for simplicity. fast2
makes a 1D array of the data and subtracts it from a 2D (N x 1) array of the data. This creates the difference matrix between each point and all of the other points without having to use repmat
and repeat
and thereby bypasses creating a lot of extra arrays. This is where the real speed difference lies in my opinion. fast1
creates a lot of extra in between matrices (and they are created expensively computationally) to find the differences between points while fast2
better harnesses the power of Numpy to avoid these.
By the way, here is a little bit faster version of fast2
:
def calcDistanceMatrixFastEuclidean3(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
data = nDimPoints[:,0]
delta = (data - data[:,newaxis])**2
for d in xrange(1,m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
The difference is that we are no longer creating delta as a zeros matrix.
dis
for fun:
dis.dis(calcDistanceMatrixFastEuclidean)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (points)
6 CALL_FUNCTION 1
9 STORE_FAST 1 (numPoints)
3 12 LOAD_GLOBAL 1 (sqrt)
15 LOAD_GLOBAL 2 (sum)
18 LOAD_GLOBAL 3 (repmat)
21 LOAD_FAST 0 (points)
24 LOAD_FAST 1 (numPoints)
27 LOAD_CONST 1 (1)
30 CALL_FUNCTION 3
4 33 LOAD_GLOBAL 4 (repeat)
36 LOAD_FAST 0 (points)
39 LOAD_FAST 1 (numPoints)
42 LOAD_CONST 2 ('axis')
45 LOAD_CONST 3 (0)
48 CALL_FUNCTION 258
51 BINARY_SUBTRACT
52 LOAD_CONST 4 (2)
55 BINARY_POWER
56 LOAD_CONST 2 ('axis')
59 LOAD_CONST 1 (1)
62 CALL_FUNCTION 257
65 CALL_FUNCTION 1
68 STORE_FAST 2 (distMat)
5 71 LOAD_FAST 2 (distMat)
74 LOAD_ATTR 5 (reshape)
77 LOAD_FAST 1 (numPoints)
80 LOAD_FAST 1 (numPoints)
83 BUILD_TUPLE 2
86 CALL_FUNCTION 1
89 RETURN_VALUE
dis.dis(calcDistanceMatrixFastEuclidean2)
2 0 LOAD_GLOBAL 0 (array)
3 LOAD_FAST 0 (nDimPoints)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (nDimPoints)
3 12 LOAD_FAST 0 (nDimPoints)
15 LOAD_ATTR 1 (shape)
18 UNPACK_SEQUENCE 2
21 STORE_FAST 1 (n)
24 STORE_FAST 2 (m)
4 27 LOAD_GLOBAL 2 (zeros)
30 LOAD_FAST 1 (n)
33 LOAD_FAST 1 (n)
36 BUILD_TUPLE 2
39 LOAD_CONST 1 ('d')
42 CALL_FUNCTION 2
45 STORE_FAST 3 (delta)
5 48 SETUP_LOOP 76 (to 127)
51 LOAD_GLOBAL 3 (xrange)
54 LOAD_FAST 2 (m)
57 CALL_FUNCTION 1
60 GET_ITER
>> 61 FOR_ITER 62 (to 126)
64 STORE_FAST 4 (d)
6 67 LOAD_FAST 0 (nDimPoints)
70 LOAD_CONST 0 (None)
73 LOAD_CONST 0 (None)
76 BUILD_SLICE 2
79 LOAD_FAST 4 (d)
82 BUILD_TUPLE 2
85 BINARY_SUBSCR
86 STORE_FAST 5 (data)
7 89 LOAD_FAST 3 (delta)
92 LOAD_FAST 5 (data)
95 LOAD_FAST 5 (data)
98 LOAD_CONST 0 (None)
101 LOAD_CONST 0 (None)
104 BUILD_SLICE 2
107 LOAD_GLOBAL 4 (newaxis)
110 BUILD_TUPLE 2
113 BINARY_SUBSCR
114 BINARY_SUBTRACT
115 LOAD_CONST 2 (2)
118 BINARY_POWER
119 INPLACE_ADD
120 STORE_FAST 3 (delta)
123 JUMP_ABSOLUTE 61
>> 126 POP_BLOCK
8 >> 127 LOAD_GLOBAL 5 (sqrt)
130 LOAD_FAST 3 (delta)
133 CALL_FUNCTION 1
136 RETURN_VALUE
I'm not an expert on dis
, but it seems like you would have to look more at the functions that the first is calling to know why they take a while. There is a performance profiler tool with Python as well, cProfile
.