A basic/common class in R is called \"dist\"
, and is a relatively efficient representation of a symmetric distance matrix. Unlike a \"matrix\"
object,
There do not seem to be tools in stats
package for this. Thanks to @flodel for an alternative implementation in a non-core package.
I dug into the definition of the "dist"
class in the core R source, which is old-school S3 with no tools in the dist.R
source file like what I'm asking about in this question.
The documentation of the dist()
function does point out, usefully, that (and I quote):
The lower triangle of the distance matrix stored by columns in a vector, say do
. If n
is the number of observations, i.e., n <- attr(do, "Size")
, then for i < j ≤ n, the dissimilarity between (row) i
and j
is:
do[n*(i-1) - i*(i-1)/2 + j-i]
The length of the vector is n*(n-1)/2
, i.e., of order n^2
.
(end quote)
I took advantage of this in the following example code for a define-yourself "dist"
accessor. Note that this example can only return one value at a time.
################################################################################
# Define dist accessor
################################################################################
setOldClass("dist")
getDistIndex <- function(x, i, j){
n <- attr(x, "Size")
if( class(i) == "character"){ i <- which(i[1] == attr(x, "Labels")) }
if( class(j) == "character"){ j <- which(j[1] == attr(x, "Labels")) }
# switch indices (symmetric) if i is bigger than j
if( i > j ){
i0 <- i
i <- j
j <- i0
}
# for i < j <= n
return( n*(i-1) - i*(i-1)/2 + j-i )
}
# Define the accessor
"[.dist" <- function(x, i, j, ...){
x[[getDistIndex(x, i, j)]]
}
################################################################################
And this seems to work fine, as expected. However, I'm having trouble getting the replacement function to work.
################################################################################
# Define the replacement function
################################################################################
"[.dist<-" <- function(x, i, j, value){
x[[get.dist.index(x, i, j)]] <- value
return(x)
}
################################################################################
A test-run of this new assignment operator
dist1["5", "3"] <- 7000
Returns:
"R> Error in dist1["5", "3"] <- 7000
: incorrect number of subscripts on matrix"
As-asked, I think @flodel answered the question better, but still thought this "answer" might also be useful.
I also found some nice S4 examples of square-bracket accessor and replacement definitions in the Matrix package, which could be adapted from this current example pretty easily.
I don't have a straight answer to your question, but if you are using the Euclidian distance, have a look at the rdist
function from the fields
package. Its implementation (in Fortran) is faster than dist
, and the output is of class matrix
. At the very least, it shows that some developers have chosen to move away from this dist
class, maybe for the exact reason you are mentioning. If you are concerned that using a full matrix
for storing a symmetric matrix is an inefficient use of memory, you could convert it to a triangular matrix.
library("fields")
points <- matrix(runif(1000*100), nrow=1000, ncol=100)
system.time(dist1 <- dist(points))
# user system elapsed
# 7.277 0.000 7.338
system.time(dist2 <- rdist(points))
# user system elapsed
# 2.756 0.060 2.851
class(dist2)
# [1] "matrix"
dim(dist2)
# [1] 1000 1000
dist2[1:3, 1:3]
# [,1] [,2] [,3]
# [1,] 0.0000000001 3.9529674733 3.8051198575
# [2,] 3.9529674733 0.0000000001 3.6552146293
# [3,] 3.8051198575 3.6552146293 0.0000000001
You may find this useful [from ??dist]:
The lower triangle of the distance matrix stored by columns in a vector, say ‘do’. If ‘n’ is the number of observations, i.e., ‘n <- attr(do, "Size")’, then for i < j <= n, the dissimilarity between (row) i and j is ‘do[n*(i-1) - i*(i-1)/2 + j-i]’. The length of the vector is n*(n-1)/2, i.e., of order n^2.
There aren't standard ways of doing this, unfortunately. Here's are two functions that convert between the 1D index into the 2D matrix coordinates. They aren't pretty, but they work, and at least you can use the code to make something nicer if you need it. I'm posting it just because the equations aren't obvious.
distdex<-function(i,j,n) #given row, column, and n, return index
n*(i-1) - i*(i-1)/2 + j-i
rowcol<-function(ix,n) { #given index, return row and column
nr=ceiling(n-(1+sqrt(1+4*(n^2-n-2*ix)))/2)
nc=n-(2*n-nr+1)*nr/2+ix+nr
cbind(nr,nc)
}
A little test harness to show it works:
dist(rnorm(20))->testd
as.matrix(testd)[7,13] #row<col
distdex(7,13,20) # =105
testd[105] #same as above
testd[c(42,119)]
rowcol(c(42,119),20) # = (3,8) and (8,15)
as.matrix(testd)[3,8]
as.matrix(testd)[8,15]
Converting to a matrix was also out of question for me, because the resulting matrix would be 35K by 35K, so I left it as a vector (result of dist) and wrote a function to find the place in the vector where the distance should be:
distXY <- function(X,Y,n){
A=min(X,Y)
B=max(X,Y)
d=eval(parse(text=
paste0("(A-1)*n -",paste0((1:(A-1)),collapse="-"),"+ B-A")))
return(d)
}
Where you provide X and Y, the original rows of the elements in the matrix from which you calculated dist, and n is the total number of elements in that matrix. The result is the position in the dist vector where the distance will be. I hope it makes sense.
disto package provides a class that wraps distance matrices in R (in-memory and out-of-core) and provides much more than the convenience operators like [
. Please check the vignette here.
PS: I am the author of the package.