Calculate Distances Between One Point in Matrix From All Other Points

后端 未结 4 2237
借酒劲吻你
借酒劲吻你 2021-02-15 16:50

I am new to Python and I need to implement a clustering algorithm. For that, I will need to calculate distances between the given input data.

Consider the following inpu

相关标签:
4条回答
  • 2021-02-15 16:54

    Here's one approach using SciPy's cdist -

    from scipy.spatial.distance import cdist
    def closest_rows(a):
        # Get euclidean distances as 2D array
        dists = cdist(a, a, 'sqeuclidean')
    
        # Fill diagonals with something greater than all elements as we intend
        # to get argmin indices later on and then index into input array with those
        # indices to get the closest rows
        dists.ravel()[::dists.shape[1]+1] = dists.max()+1
        return a[dists.argmin(1)]
    

    Sample run -

    In [72]: a
    Out[72]: 
    array([[1, 2, 8],
           [7, 4, 2],
           [9, 1, 7],
           [0, 1, 5],
           [6, 4, 3]])
    
    In [73]: closest_rows(a)
    Out[73]: 
    array([[0, 1, 5],
           [6, 4, 3],
           [6, 4, 3],
           [1, 2, 8],
           [7, 4, 2]])
    

    Runtime test

    Other working approach(es) -

    def norm_app(a): # @Psidom's soln
        dist = np.linalg.norm(a - a[:,None], axis=-1); 
        dist[np.arange(dist.shape[0]), np.arange(dist.shape[0])] = np.nan
        return a[np.nanargmin(dist, axis=0)]
    

    Timings with 10,000 points -

    In [79]: a = np.random.randint(0,9,(10000,3))
    
    In [80]: %timeit norm_app(a) # @Psidom's soln
    1 loop, best of 3: 3.83 s per loop
    
    In [81]: %timeit closest_rows(a)
    1 loop, best of 3: 392 ms per loop
    

    Further performance boost

    There's eucl_dist package (disclaimer: I am its author) that contains various methods to compute euclidean distances that are much more efficient than SciPy's cdist, especially for large arrays.

    Thus, making use of it, we would have a more performant one, like so -

    from eucl_dist.cpu_dist import dist
    def closest_rows_v2(a):
        dists = dist(a,a, matmul="gemm", method="ext") 
        dists.ravel()[::dists.shape[1]+1] = dists.max()+1
        return a[dists.argmin(1)]
    

    Timings -

    In [162]: a = np.random.randint(0,9,(10000,3))
    
    In [163]: %timeit closest_rows(a)
    1 loop, best of 3: 394 ms per loop
    
    In [164]: %timeit closest_rows_v2(a)
    1 loop, best of 3: 229 ms per loop
    
    0 讨论(0)
  • 2021-02-15 17:05

    From this thread's you can use the e_dist function there and also obtain the same results.

    Addendum

    Timing: on my memory starved laptop, I can only do a comparison to a smaller sample than @Psidom 's using his norm_app function.

    a = np.random.randint(0,9,(5000,3))

    %timeit norm_app(a) 1.91 s ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    %timeit e_dist(a, a) 631 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    a 
    array([[1, 2, 8],
           [7, 4, 2],
           [9, 1, 7],
           [0, 1, 5],
           [6, 4, 3]])
    
    dm = e_dist(a, a)  # get the def from the link
    
    dm
    Out[7]: 
    array([[ 0.  ,  8.72,  8.12,  3.32,  7.35],
           [ 8.72,  0.  ,  6.16,  8.19,  1.41],
           [ 8.12,  6.16,  0.  ,  9.22,  5.83],
           [ 3.32,  8.19,  9.22,  0.  ,  7.  ],
           [ 7.35,  1.41,  5.83,  7.  ,  0.  ]])
    
    idx = np.argsort(dm)
    
    closest = a[idx]
    
    closest
    Out[10]: 
    array([[[1, 2, 8],
            [0, 1, 5],
            [6, 4, 3],
            [9, 1, 7],
            [7, 4, 2]],
    
           [[7, 4, 2],
            [6, 4, 3],
            [9, 1, 7],
            [0, 1, 5],
            [1, 2, 8]],
    
           [[9, 1, 7],
            [6, 4, 3],
            [7, 4, 2],
            [1, 2, 8],
            [0, 1, 5]],
    
           [[0, 1, 5],
            [1, 2, 8],
            [6, 4, 3],
            [7, 4, 2],
            [9, 1, 7]],
    
           [[6, 4, 3],
            [7, 4, 2],
            [9, 1, 7],
            [0, 1, 5],
            [1, 2, 8]]])
    
    0 讨论(0)
  • 2021-02-15 17:06

    I suggest using pdist and squareform from scipy.spatial.distance

    Consider the following array of points:

    a = np.array([[1,2,8], [7,4,2], [9,1,7], [0,1,5], [6,4,3]])
    

    If you want to display all distances between point [1,2,8] and the other points:

    squareform(pdist(a))
    
    Out[1]: array([[ 0.        ,  8.71779789,  8.1240384 ,  3.31662479,  7.34846923],
                   [ 8.71779789,  0.        ,  6.164414  ,  8.18535277,  1.41421356],
                   [ 8.1240384 ,  6.164414  ,  0.        ,  9.21954446,  5.83095189],
                   [ 3.31662479,  8.18535277,  9.21954446,  0.        ,  7.        ],
                   [ 7.34846923,  1.41421356,  5.83095189,  7.        ,  0.        ]])
    

    I you want to display the shortest distance between point [1,2,8] and the closest point:

    sorted(squareform(pdist(a))[0])[1]
    
    Out[2]: 3.3166247903553998
    

    [0] being the index of your first point ([1,2,8])

    [1] being the index of the second minimum value (to avoid zeros)

    If you want to display the index of the closest point to [1,2,8]:

    np.argsort(squareform(pdist(a))[0])[1]
    
    Out[3]: 3
    
    0 讨论(0)
  • 2021-02-15 17:15

    Use np.linalg.norm combined with broadcasting (numpy outer subtraction), you can do:

    np.linalg.norm(a - a[:,None], axis=-1)
    

    a[:,None] insert a new axis into a, a - a[:,None] will then do a row by row subtraction due to broadcasting. np.linalg.norm calculates the np.sqrt(np.sum(np.square(...))) over the last axis:


    a = np.array([[1,2,8],
         [7,4,2],
         [9,1,7],
         [0,1,5],
         [6,4,3]])
    
    np.linalg.norm(a - a[:,None], axis=-1)
    #array([[ 0.        ,  8.71779789,  8.1240384 ,  3.31662479,  7.34846923],
    #       [ 8.71779789,  0.        ,  6.164414  ,  8.18535277,  1.41421356],
    #       [ 8.1240384 ,  6.164414  ,  0.        ,  9.21954446,  5.83095189],
    #       [ 3.31662479,  8.18535277,  9.21954446,  0.        ,  7.        ],
    #       [ 7.34846923,  1.41421356,  5.83095189,  7.        ,  0.        ]])
    

    The elements [0,1], [0,2] for instance correspond to:

    np.sqrt(np.sum((a[0] - a[1]) ** 2))
    # 8.717797887081348
    
    np.sqrt(np.sum((a[0] - a[2]) ** 2))
    # 8.1240384046359608
    

    respectively.

    0 讨论(0)
提交回复
热议问题