How do I manipulate/access elements of an instance of “dist” class using core R?

前端 未结 12 1946
傲寒
傲寒 2021-02-02 10:50

A basic/common class in R is called \"dist\", and is a relatively efficient representation of a symmetric distance matrix. Unlike a \"matrix\" object,

相关标签:
12条回答
  • 2021-02-02 11:14

    There do not seem to be tools in stats package for this. Thanks to @flodel for an alternative implementation in a non-core package.

    I dug into the definition of the "dist" class in the core R source, which is old-school S3 with no tools in the dist.R source file like what I'm asking about in this question.

    The documentation of the dist() function does point out, usefully, that (and I quote):

    The lower triangle of the distance matrix stored by columns in a vector, say do. If n is the number of observations, i.e., n <- attr(do, "Size"), then for i < j ≤ n, the dissimilarity between (row) i and j is:

    do[n*(i-1) - i*(i-1)/2 + j-i]

    The length of the vector is n*(n-1)/2, i.e., of order n^2.

    (end quote)

    I took advantage of this in the following example code for a define-yourself "dist" accessor. Note that this example can only return one value at a time.

    ################################################################################
    # Define dist accessor
    ################################################################################
    setOldClass("dist")
    getDistIndex <- function(x, i, j){
        n <- attr(x, "Size")
        if( class(i) == "character"){ i <- which(i[1] == attr(x, "Labels")) }
        if( class(j) == "character"){ j <- which(j[1] == attr(x, "Labels")) }
        # switch indices (symmetric) if i is bigger than j
        if( i > j ){
            i0 <- i
            i  <- j
            j  <- i0
        }
        # for i < j <= n
        return( n*(i-1) - i*(i-1)/2 + j-i )
    }
    # Define the accessor
    "[.dist" <- function(x, i, j, ...){
        x[[getDistIndex(x, i, j)]]
    }
    ################################################################################
    

    And this seems to work fine, as expected. However, I'm having trouble getting the replacement function to work.

    ################################################################################
    # Define the replacement function
    ################################################################################
    "[.dist<-" <- function(x, i, j, value){
        x[[get.dist.index(x, i, j)]] <- value
        return(x)
    }
    ################################################################################
    

    A test-run of this new assignment operator

    dist1["5", "3"] <- 7000
    

    Returns:

    "R> Error in dist1["5", "3"] <- 7000 : incorrect number of subscripts on matrix"

    As-asked, I think @flodel answered the question better, but still thought this "answer" might also be useful.

    I also found some nice S4 examples of square-bracket accessor and replacement definitions in the Matrix package, which could be adapted from this current example pretty easily.

    0 讨论(0)
  • 2021-02-02 11:15

    I don't have a straight answer to your question, but if you are using the Euclidian distance, have a look at the rdist function from the fields package. Its implementation (in Fortran) is faster than dist, and the output is of class matrix. At the very least, it shows that some developers have chosen to move away from this dist class, maybe for the exact reason you are mentioning. If you are concerned that using a full matrix for storing a symmetric matrix is an inefficient use of memory, you could convert it to a triangular matrix.

    library("fields")
    points <- matrix(runif(1000*100), nrow=1000, ncol=100)
    
    system.time(dist1 <- dist(points))
    #    user  system elapsed 
    #   7.277   0.000   7.338 
    
    system.time(dist2 <- rdist(points))
    #   user  system elapsed 
    #  2.756   0.060   2.851 
    
    class(dist2)
    # [1] "matrix"
    dim(dist2)
    # [1] 1000 1000
    dist2[1:3, 1:3]
    #              [,1]         [,2]         [,3]
    # [1,] 0.0000000001 3.9529674733 3.8051198575
    # [2,] 3.9529674733 0.0000000001 3.6552146293
    # [3,] 3.8051198575 3.6552146293 0.0000000001
    
    0 讨论(0)
  • 2021-02-02 11:18

    You may find this useful [from ??dist]:

    The lower triangle of the distance matrix stored by columns in a vector, say ‘do’. If ‘n’ is the number of observations, i.e., ‘n <- attr(do, "Size")’, then for i < j <= n, the dissimilarity between (row) i and j is ‘do[n*(i-1) - i*(i-1)/2 + j-i]’. The length of the vector is n*(n-1)/2, i.e., of order n^2.

    0 讨论(0)
  • 2021-02-02 11:20

    There aren't standard ways of doing this, unfortunately. Here's are two functions that convert between the 1D index into the 2D matrix coordinates. They aren't pretty, but they work, and at least you can use the code to make something nicer if you need it. I'm posting it just because the equations aren't obvious.

    distdex<-function(i,j,n) #given row, column, and n, return index
        n*(i-1) - i*(i-1)/2 + j-i
    
    rowcol<-function(ix,n) { #given index, return row and column
        nr=ceiling(n-(1+sqrt(1+4*(n^2-n-2*ix)))/2)
        nc=n-(2*n-nr+1)*nr/2+ix+nr
        cbind(nr,nc)
    }
    

    A little test harness to show it works:

    dist(rnorm(20))->testd
    as.matrix(testd)[7,13]   #row<col
    distdex(7,13,20) # =105
    testd[105]   #same as above
    
    testd[c(42,119)]
    rowcol(c(42,119),20)  # = (3,8) and (8,15)
    as.matrix(testd)[3,8]
    as.matrix(testd)[8,15]
    
    0 讨论(0)
  • 2021-02-02 11:26

    Converting to a matrix was also out of question for me, because the resulting matrix would be 35K by 35K, so I left it as a vector (result of dist) and wrote a function to find the place in the vector where the distance should be:

    distXY <- function(X,Y,n){
      A=min(X,Y)
      B=max(X,Y)
    
      d=eval(parse(text=
                   paste0("(A-1)*n  -",paste0((1:(A-1)),collapse="-"),"+ B-A")))
    
      return(d)
    
    }
    

    Where you provide X and Y, the original rows of the elements in the matrix from which you calculated dist, and n is the total number of elements in that matrix. The result is the position in the dist vector where the distance will be. I hope it makes sense.

    0 讨论(0)
  • 2021-02-02 11:26

    disto package provides a class that wraps distance matrices in R (in-memory and out-of-core) and provides much more than the convenience operators like [. Please check the vignette here.

    PS: I am the author of the package.

    0 讨论(0)
提交回复
热议问题