Condensed matrix function to find pairs

后端 未结 7 1117
忘了有多久
忘了有多久 2020-12-24 05:18

For a set of observations:

[a1,a2,a3,a4,a5]

their pairwise distances

d=[[0,a12,a13,a14,a15]
   [a21,0,a23,a24,a25]
   [a31,         


        
相关标签:
7条回答
  • 2020-12-24 05:32

    The formula for an index of the condensed matrix is

    index = d*(d-1)/2 - (d-i)*(d-i-1)/2 + j - i - 1
    

    Where i is the row index, j is the column index, and d is the row length of the original (d X d) upper triangular matrix.

    Consider the case when the index refers to the leftmost, non-zero entry of some row in the original matrix. For all the leftmost indices,

    j == i + 1
    

    so

    index = d*(d-1)/2 - (d-i)*(d-i-1)/2 + i + 1 - i - 1
    index = d*(d-1)/2 - (d-i)*(d-i-1)/2
    

    With some algebra, we can rewrite this as

    i**2 + (1 - 2d)*i + 2*index == 0
    

    Then we can use the quadratic formula to find the roots of the equation, and we only are going to care about the positive root.

    If this index does correspond to leftmost, non-zero cell, then we get a positive integer as a solution that corresponds to the row number. Then, finding the column number is just arithmetic.

    j = index - d*(d-1)/2 + (d-i)*(d-i-1)/2 + i + 1
    

    If the index does not correspond to the leftmost, non-zero cell, then we will not find an integer root, but we can take the floor of the positive root as the row number.

    def row_col_from_condensed_index(d,index):
        b = 1 -2*d 
        i = math.floor((-b - math.sqrt(b**2 - 8*index))/2)
        j = index + i*(b + i + 2)/2 + 1
        return (i,j)  
    

    If you don't know d, you can figure it from the length of the condensed matrix.

    ((d-1)*d)/2 == len(condensed_matrix)
    d = (1 + math.sqrt(1 + 8*len(condensed_matrix)))/2 
    
    0 讨论(0)
  • 2020-12-24 05:33

    Cleary, the function f you are searching for, needs a second argument: the dimension of the matrix - in your case: 5

    First Try:

    def f(dim,i): 
      d = dim-1 ; s = d
      while i<s: 
        s+=d ; d-=1
      return (dim-d, i-s+d)
    
    0 讨论(0)
  • 2020-12-24 05:49

    Here's another solution:

    import numpy as np
    
    def f(c,n):
        tt = np.zeros_like(c)
        tt[n] = 1
        return tuple(np.nonzero(squareform(tt))[0])
    
    0 讨论(0)
  • 2020-12-24 05:49

    To improve the efficiency using numpy.triu_indices
    use this:

    def PdistIndices(n,I):
        '''idx = {} indices for pdist results'''
        idx = numpy.array(numpy.triu_indices(n,1)).T[I]
        return idx
    

    So I is an array of indices.

    However a better solution is to implement an optimized Brute-force search, say, in Fortran:

    function PdistIndices(n,indices,m) result(IJ)
        !IJ = {} indices for pdist[python] selected results[indices]
        implicit none
        integer:: i,j,m,n,k,w,indices(0:m-1),IJ(0:m-1,2)
        logical:: finished
        k = 0; w = 0; finished = .false.
        do i=0,n-2
            do j=i+1,n-1
                if (k==indices(w)) then
                    IJ(w,:) = [i,j]
                    w = w+1
                    if (w==m) then
                        finished = .true.
                        exit
                    endif
                endif
                k = k+1
            enddo
            if (finished) then
                exit
            endif
        enddo
    end function
    

    then compile using F2PY and enjoy unbeatable performance. ;)

    0 讨论(0)
  • 2020-12-24 05:56

    To complete the list of answers to this question: A fast, vectorized version of fgreggs answer (as suggested by David Marx) could look like this:

    def vec_row_col(d,i):                                                                
        i = np.array(i)                                                                 
        b = 1 - 2 * d                                                                   
        x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)                                      
        y = (i + x*(b + x + 2)/2 + 1).astype(int)                                                    
        if i.shape:                                                                     
            return zip(x,y)                                                             
        else:                                                                           
            return (x,y) 
    

    I needed to do these calculations for huge arrays, and the speedup as compared to the un-vectorized version (https://stackoverflow.com/a/14839010/3631440) is (as usual) quite impressive (using IPython %timeit):

    import numpy as np
    from scipy.spatial import distance
    
    test = np.random.rand(1000,1000)
    condense = distance.pdist(test)
    sample = np.random.randint(0,len(condense), 1000)
    
    %timeit res = vec_row_col(1000, sample)
    10000 loops, best of 3: 156 µs per loop
    
    res = []
    %timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
    100 loops, best of 3: 5.87 ms per loop
    

    That's about 37 times faster in this example!

    0 讨论(0)
  • 2020-12-24 05:57

    You may find triu_indices useful. Like,

    In []: ti= triu_indices(5, 1)
    In []: r, c= ti[0][5], ti[1][5]
    In []: r, c
    Out[]: (1, 3)
    

    Just notice that indices starts from 0. You may adjust it as you like, for example:

    In []: def f(n, c):
       ..:     n= ceil(sqrt(2* n))
       ..:     ti= triu_indices(n, 1)
       ..:     return ti[0][c]+ 1, ti[1][c]+ 1
       ..:
    In []: f(len(c), 5)
    Out[]: (2, 4)
    
    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题