For a set of observations:
[a1,a2,a3,a4,a5]
their pairwise distances
d=[[0,a12,a13,a14,a15]
[a21,0,a23,a24,a25]
[a31,
-
The formula for an index of the condensed matrix is
index = d*(d-1)/2 - (d-i)*(d-i-1)/2 + j - i - 1
Where i
is the row index, j
is the column index, and d
is the row length of the original (d X d) upper triangular matrix.
Consider the case when the index refers to the leftmost, non-zero entry of some row in the original matrix. For all the leftmost indices,
j == i + 1
so
index = d*(d-1)/2 - (d-i)*(d-i-1)/2 + i + 1 - i - 1
index = d*(d-1)/2 - (d-i)*(d-i-1)/2
With some algebra, we can rewrite this as
i**2 + (1 - 2d)*i + 2*index == 0
Then we can use the quadratic formula to find the roots of the equation, and we only are going to
care about the positive root.
If this index does correspond to leftmost, non-zero cell, then we get a positive integer as a solution that
corresponds to the row number. Then, finding the column number is just arithmetic.
j = index - d*(d-1)/2 + (d-i)*(d-i-1)/2 + i + 1
If the index does not correspond to the leftmost, non-zero cell, then we will not find an integer root, but we can take the floor of the positive root as the row number.
def row_col_from_condensed_index(d,index):
b = 1 -2*d
i = math.floor((-b - math.sqrt(b**2 - 8*index))/2)
j = index + i*(b + i + 2)/2 + 1
return (i,j)
If you don't know d
, you can figure it from the length of the condensed matrix.
((d-1)*d)/2 == len(condensed_matrix)
d = (1 + math.sqrt(1 + 8*len(condensed_matrix)))/2
讨论(0)
-
Cleary, the function f you are searching for, needs a second argument: the dimension of the matrix - in your case: 5
First Try:
def f(dim,i):
d = dim-1 ; s = d
while i<s:
s+=d ; d-=1
return (dim-d, i-s+d)
讨论(0)
-
Here's another solution:
import numpy as np
def f(c,n):
tt = np.zeros_like(c)
tt[n] = 1
return tuple(np.nonzero(squareform(tt))[0])
讨论(0)
-
To improve the efficiency using numpy.triu_indices
use this:
def PdistIndices(n,I):
'''idx = {} indices for pdist results'''
idx = numpy.array(numpy.triu_indices(n,1)).T[I]
return idx
So I
is an array of indices.
However a better solution is to implement an optimized Brute-force search, say, in Fortran
:
function PdistIndices(n,indices,m) result(IJ)
!IJ = {} indices for pdist[python] selected results[indices]
implicit none
integer:: i,j,m,n,k,w,indices(0:m-1),IJ(0:m-1,2)
logical:: finished
k = 0; w = 0; finished = .false.
do i=0,n-2
do j=i+1,n-1
if (k==indices(w)) then
IJ(w,:) = [i,j]
w = w+1
if (w==m) then
finished = .true.
exit
endif
endif
k = k+1
enddo
if (finished) then
exit
endif
enddo
end function
then compile using F2PY
and enjoy unbeatable performance. ;)
讨论(0)
-
To complete the list of answers to this question: A fast, vectorized version of fgreggs answer (as suggested by David Marx) could look like this:
def vec_row_col(d,i):
i = np.array(i)
b = 1 - 2 * d
x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)
y = (i + x*(b + x + 2)/2 + 1).astype(int)
if i.shape:
return zip(x,y)
else:
return (x,y)
I needed to do these calculations for huge arrays, and the speedup as compared to the un-vectorized version (https://stackoverflow.com/a/14839010/3631440) is (as usual) quite impressive (using IPython %timeit):
import numpy as np
from scipy.spatial import distance
test = np.random.rand(1000,1000)
condense = distance.pdist(test)
sample = np.random.randint(0,len(condense), 1000)
%timeit res = vec_row_col(1000, sample)
10000 loops, best of 3: 156 µs per loop
res = []
%timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
100 loops, best of 3: 5.87 ms per loop
That's about 37 times faster in this example!
讨论(0)
-
You may find triu_indices useful. Like,
In []: ti= triu_indices(5, 1)
In []: r, c= ti[0][5], ti[1][5]
In []: r, c
Out[]: (1, 3)
Just notice that indices starts from 0. You may adjust it as you like, for example:
In []: def f(n, c):
..: n= ceil(sqrt(2* n))
..: ti= triu_indices(n, 1)
..: return ti[0][c]+ 1, ti[1][c]+ 1
..:
In []: f(len(c), 5)
Out[]: (2, 4)
讨论(0)
- 热议问题