beginner with Python here. So I\'m having trouble trying to calculate the resulting binary pairwise hammington distance matrix between the rows of an input matrix using only th
Try this approach, create a new axis along axis = 1
, and then do broadcasting and count trues or non zero with sum
:
(arr[:, None, :] != arr).sum(2)
# array([[0, 2, 3],
# [2, 0, 3],
# [3, 3, 0]])
def compute_HammingDistance(X):
return (X[:, None, :] != X).sum(2)
Explanation:
1) Create a 3d array which has shape (3,1,6)
arr[:, None, :]
#array([[[1, 0, 0, 1, 1, 0]],
# [[1, 0, 0, 0, 0, 0]],
# [[1, 1, 1, 1, 0, 0]]])
2) this is a 2d array has shape (3, 6)
arr
#array([[1, 0, 0, 1, 1, 0],
# [1, 0, 0, 0, 0, 0],
# [1, 1, 1, 1, 0, 0]])
3) This triggers broadcasting since their shape doesn't match, and the 2d array arr is firstly broadcasted along the 0 axis of 3d array arr[:, None, :], and then we have array of shape (1, 6) be broadcasted against (3, 6). The two broadcasting steps together make a cartesian comparison of the original array.
arr[:, None, :] != arr
#array([[[False, False, False, False, False, False],
# [False, False, False, True, True, False],
# [False, True, True, False, True, False]],
# [[False, False, False, True, True, False],
# [False, False, False, False, False, False],
# [False, True, True, True, False, False]],
# [[False, True, True, False, True, False],
# [False, True, True, True, False, False],
# [False, False, False, False, False, False]]], dtype=bool)
4) the sum
along the third axis count how many elements are not equal, i.e, trues which gives the hamming distance.
For reasons I do not understand this
(2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2)
appears to be much faster than @Psidom's for larger arrays:
a = np.random.randint(0,2,(100,1000))
timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 2.297890231013298
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.10616962902713567
Psidom's is a bit faster for the very small example:
a
# array([[1, 0, 0, 1, 1, 0],
# [1, 0, 0, 0, 0, 0],
# [1, 1, 1, 1, 0, 0]])
timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 0.0004370050155557692
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.00068191799800843
Update
Part of the reason appears to be floats being faster than other dtypes:
timeit(lambda: (0.5 * np.inner(2*a-1, 1-2*a) + a.shape[1] / 2), number=100)
# 0.7315902590053156
timeit(lambda: (0.5 * np.inner(2.0*a-1, 1-2.0*a) + a.shape[1] / 2), number=100)
# 0.12021801102673635