R - How to get row & column subscripts of matched elements from a distance matrix

后端未结

关注

 2  1837

I have an integer vector vec1 and I am generating a distant matrix using dist function. I want to get the coordinates (row and column) of element o

R functions

It is easy to write vectorized R functions for those index transforms. We only need some care dealing with "out-of-bound" index, for which NA should be returned.

## 2D index to 1D index
f <- function (i, j, dist_obj) {
  if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
  n <- attr(dist_obj, "Size")
  valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
  k <- (2 * n - j) * (j - 1) / 2 + (i - j)
  k[!valid] <- NA_real_
  k
  }

## 1D index to 2D index
finv <- function (k, dist_obj) {
  if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
  n <- attr(dist_obj, "Size")
  valid <- (k >= 1) & (k <= n * (n - 1) / 2)
  k_valid <- k[valid]
  j <- rep.int(NA_real_, length(k))
  j[valid] <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k_valid - 1))) / 2)
  i <- j + k - (2 * n - j) * (j - 1) / 2
  cbind(i, j)
  }

These functions are extremely cheap in memory usage, as they work with index instead of matrices.

Applying `finv` to your question

You can use

vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)

finv(which(distMatrix == 5), distMatrix)
#     i j
#[1,] 5 4

Generally speaking, a distance matrix contains floating point numbers. It is risky to use == to judge whether two floating point numbers are equal. Read Why are these numbers not equal? for more and possible strategies.

Alternative with `dist2mat`

Using the dist2mat function given in as.matrix on a distance object is extremely slow; how to make it faster?, we may use which(, arr.ind = TRUE).

library(Rcpp)
sourceCpp("dist2mat.cpp")
mat <- dist2mat(distMatrix, 128)
which(mat == 5, arr.ind = TRUE)
#  row col
#5   5   4
#4   4   5

Appendix: Markdown (needs MathJax support) for the picture

## 2D index to 1D index

The lower triangular looks like this: $$\begin{pmatrix} 0 & 0 & \cdots & 0\\ \times & 0 & \cdots & 0\\ \times & \times & \cdots & 0\\ \vdots & \vdots & \ddots & 0\\ \times & \times & \cdots & 0\end{pmatrix}$$ If the matrix is $n \times n$, then there are $(n - 1)$ elements ("$\times$") in the 1st column, and $(n - j)$ elements in the j<sup>th</sup> column. Thus, for element $(i,\  j)$ (with $i > j$, $j < n$) in the lower triangular, there are $$(n - 1) + \cdots (n - (j - 1)) = \frac{(2n - j)(j - 1)}{2}$$ "$\times$" in the previous $(j - 1)$ columns, and it is the $(i - j)$<sup>th</sup> "$\times$" in the $j$<sup>th</sup> column. So it is the $$\left\{\frac{(2n - j)(j - 1)}{2} + (i - j)\right\}^{\textit{th}}$$ "$\times$" in the lower triangular.

----

## 1D index to 2D index

Now for the $k$<sup>th</sup> "$\times$" in the lower triangular, how can we find its matrix index $(i,\ j)$? We take two steps: 1> find $j$;  2> obtain $i$ from $k$ and $j$.

The first "$\times$" of the $j$<sup>th</sup> column, i.e., $(j + 1,\ j)$, is the $\left\{\frac{(2n - j)(j - 1)}{2} + 1\right\}^{\textit{th}}$ "$\times$" of the lower triangular, thus $j$ is the maximum value such that $\frac{(2n - j)(j - 1)}{2} + 1 \leq k$. This is equivalent to finding the max $j$ so that $$j^2 - (2n + 1)j + 2(k + n - 1) \geq 0.$$ The LHS is a quadratic polynomial, and it is easy to see that the solution is the integer no larger than its first root (i.e., the root on the left side): $$j = \left\lfloor\frac{(2n + 1) - \sqrt{(2n-1)^2 - 8(k-1)}}{2}\right\rfloor.$$ Then $i$ can be obtained from $$i = j + k - \left\{\frac{(2n - j)(j - 1)}{2}\right\}.$$

0 讨论(0)

耶瑟儿～

2020-11-27 22:37
If the vector is not too large, the best way is probably to wrap the output of dist into as.matrix and to use which with the option arr.ind=TRUE. The only disadvantage of this standard method to retrieve the index numbers within a dist matrix is an increase of memory usage, which may become important in the case of very large vectors passed to dist. This is because the conversion of the lower triangular matrix returned by dist into a regular, dense matrix effectively doubles the amount of stored data.

An alternative consists in converting the dist object into a list, such that each column in the lower triangular matrix of dist represents one member of the list. The index number of the list members and the position of the elements within the list members can then be mapped to the column and row number of the dense N x N matrix, without generating the matrix.

Here is one possible implementation of this list-based approach:
```
distToList <- function(x) {
  idx <- sum(seq(length(x) - 1)) - rev(cumsum(seq(length(x) - 1))) + 1
  listDist <- unname(split(dist(x), cumsum(seq_along(dist(x)) %in% idx)))
  # http://stackoverflow.com/a/16358095/4770166
}
findDistPairs <- function(vec, theDist) {
  listDist <- distToList(vec)
  inList <- lapply(listDist, is.element, theDist)
  matchedCols <- which(sapply(inList, sum) > 0) 
  if (length(matchedCols) > 0) found <- TRUE else found <- FALSE
  if (found) {
    matchedRows <- sapply(matchedCols, function(x) which(inList[[x]]) + x ) 
  } else {matchedRows <- integer(length = 0)}
  matches <- cbind(col=rep(matchedCols, sapply(matchedRows,length)), 
                   row=unlist(matchedRows))
  return(matches)
}

vec1 <- c(2, 3, 6, 12, 17)
findDistPairs(vec1, 5)
#     col row
#[1,]   4   5
```
The parts of the code that might be somewhat unclear concern the mapping of the position of an entry within the list to a column / row value of the N x N matrix. While not trivial, these transformations are straightforward.

In a comment within the code I have pointed out an answer on StackOverflow which has been used here to split a vector into a list. The loops (sapply, lapply) should be unproblematic in terms of performance since their range is of order O(N). The memory usage of this code is largely determined by the storage of the list. This amount of memory should be similar to that of the dist object since both objects contain the same data.

The dist object is calculated and transformed into a list in the function distToList(). Because of the dist calculation, which is required in any case, this function could be time-consuming in the case of large vectors. If the goal is to find several pairs with different distance values, then it may be better to calculate listDist only once for a given vector and to store the resulting list, e.g., in the global environment.

Long story short

The usual way to treat such problems is simple and fast:
```
distMatrix <- as.matrix(dist(vec1)) * lower.tri(diag(vec1))
which(distMatrix == 5, arr.ind = TRUE)
#  row col
#5   5   4
```
I suggest using this method by default. More complicated solutions may become necessary in situations where memory limits are reached, i.e., in the case of very large vectors vec1. The list-based approach described above could then provide a remedy.
0 讨论(0)
发布评论:

提交评论
- 加载中...

R - How to get row & column subscripts of matched elements from a distance matrix

R functions

Applying finv to your question

Alternative with dist2mat

Appendix: Markdown (needs MathJax support) for the picture

Applying `finv` to your question

Alternative with `dist2mat`