Python - How to generate the Pairwise Hamming Distance Matrix

后端 未结 2 620
既然无缘
既然无缘 2021-01-23 06:19

beginner with Python here. So I\'m having trouble trying to calculate the resulting binary pairwise hammington distance matrix between the rows of an input matrix using only th

相关标签:
2条回答
  • 2021-01-23 06:47

    Try this approach, create a new axis along axis = 1, and then do broadcasting and count trues or non zero with sum:

    (arr[:, None, :] != arr).sum(2)
    
    # array([[0, 2, 3],
    #        [2, 0, 3],
    #        [3, 3, 0]])
    

    def compute_HammingDistance(X):
        return (X[:, None, :] != X).sum(2)
    

    Explanation:

    1) Create a 3d array which has shape (3,1,6)

    arr[:, None, :]
    #array([[[1, 0, 0, 1, 1, 0]],
    #       [[1, 0, 0, 0, 0, 0]],
    #       [[1, 1, 1, 1, 0, 0]]])
    

    2) this is a 2d array has shape (3, 6)

    arr   
    #array([[1, 0, 0, 1, 1, 0],
    #       [1, 0, 0, 0, 0, 0],
    #       [1, 1, 1, 1, 0, 0]])
    

    3) This triggers broadcasting since their shape doesn't match, and the 2d array arr is firstly broadcasted along the 0 axis of 3d array arr[:, None, :], and then we have array of shape (1, 6) be broadcasted against (3, 6). The two broadcasting steps together make a cartesian comparison of the original array.

    arr[:, None, :] != arr 
    #array([[[False, False, False, False, False, False],
    #        [False, False, False,  True,  True, False],
    #        [False,  True,  True, False,  True, False]],
    #       [[False, False, False,  True,  True, False],
    #        [False, False, False, False, False, False],
    #        [False,  True,  True,  True, False, False]],
    #       [[False,  True,  True, False,  True, False],
    #        [False,  True,  True,  True, False, False],
    #        [False, False, False, False, False, False]]], dtype=bool)
    

    4) the sum along the third axis count how many elements are not equal, i.e, trues which gives the hamming distance.

    0 讨论(0)
  • 2021-01-23 07:00

    For reasons I do not understand this

    (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2)
    

    appears to be much faster than @Psidom's for larger arrays:

    a = np.random.randint(0,2,(100,1000))
    timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
    # 2.297890231013298
    timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
    # 0.10616962902713567
    

    Psidom's is a bit faster for the very small example:

    a
    # array([[1, 0, 0, 1, 1, 0],
    #        [1, 0, 0, 0, 0, 0],
    #        [1, 1, 1, 1, 0, 0]])
    
    timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
    # 0.0004370050155557692
    timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
    # 0.00068191799800843
    

    Update

    Part of the reason appears to be floats being faster than other dtypes:

    timeit(lambda: (0.5 * np.inner(2*a-1, 1-2*a) + a.shape[1] / 2), number=100)
    # 0.7315902590053156
    timeit(lambda: (0.5 * np.inner(2.0*a-1, 1-2.0*a) + a.shape[1] / 2), number=100)
    # 0.12021801102673635
    
    0 讨论(0)
提交回复
热议问题