as.matrix on a distance object is extremely slow; how to make it faster?

拟墨画扇 提交于 2019-11-28 10:01:31

问题


I found an R package Rlof which uses multithreading to calculate distance matrices and it does a wonderful job.

However, the output of the function distmc is a vector rather than a matrix. Applying as.matrix to this "dist" object turns out much more expensive than the multi-threaded calculation of distances.

Looking at the function help, the options for printing the diagonal and upper triangle are there, but I don't understand where they are supposed to be used.

Is there a way of saving the time of as.matrix somehow?

Reproducible example:

set.seed(42)
M1 = matrix(rnorm(15000*20), nrow = 15000, ncol =20)
system.time({dA = distmc(M1, method = "euclidean", diag = TRUE,
                         upper = TRUE, p = 2)})
system.time(A = as.matrix(dA))

回答1:


What does dist return?

This function always returns a vector, holding the lower triangular part (by column) of the full matrix. The diag or upper argument only affects printing, i.e., stats:::print.dist. This function can print this vector as if it were a matrix; but it is really not.


Why is as.matrix bad for a "dist" object?

It is hard to efficiently work with triangular matrices or to further make them symmetric in R core. lower.tri and upper.tri can be memory consuming if your matrix is large: R: Convert upper triangular part of a matrix to symmetric matrix.

Converting a "dist" object to a matrix is worse. Look at the code of stats:::as.matrix.dist:

function (x, ...) 
{
    size <- attr(x, "Size")
    df <- matrix(0, size, size)
    df[row(df) > col(df)] <- x
    df <- df + t(df)
    labels <- attr(x, "Labels")
    dimnames(df) <- if (is.null(labels)) 
    list(seq_len(size), seq_len(size))
    else list(labels, labels)
    df
}

The use of row, col and t are a nightmare. And the final "dimnames<-" generates another big temporary matrix object. When memory becomes a bottleneck, everything is slow.


But we still possibly need a full matrix, as it is easy to use.

The awkward thing is that it is easier to work with a full matrix so we want it. Consider this example: R - How to get row & column subscripts of matched elements from a distance matrix. Operations are tricky if we try to avoid forming the complete matrix.


An Rcpp solution

I write an Rcpp function dist2mat (see "dist2mat.cpp" source file in the end of this answer).

The function has two inputs: a "dist" object x and a (integer) cache blocking factor bf. The function works by first creating a matrix and fill in its lower triangular part, then copying the lower triangular part to upper triangular to make it symmetric. The second step is a typical transposition operation and for large matrix cache blocking is worthwhile. Performance should be insensitive to the cache blocking factor as long as it is not too small or too large. 128 or 256 is generally a good choice.

This is my first try with Rcpp. I have been a C programmer using R's conventional C interface. But I am on my way to get familiar with Rcpp as well. Given that you don't know how to write compiled code, your probably also don't know how to run Rcpp functions. You need to

  1. install Rcpp package (not sure if you further need Rtools if you are on Windows);
  2. copy my "dist2mat.cpp" into a file under your R's current working directory (you can get it from getwd() in your R session). A ".cpp" file is just a plain text file so you can create, edit and save it with any text editor.

Now let's start the showcase.

library(Rcpp)
sourceCpp("dist2mat.cpp")  ## this takes some time; be patient

## a simple test with `dist2mat`
set.seed(0)
x <- dist(matrix(runif(10), nrow = 5, dimnames = list(letters[1:5], NULL)))
A <- dist2mat(x, 128)  ## cache blocking factor = 128
A
#          a         b         c         d         e
#a 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#b 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#c 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#d 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#e 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

The resulting matrix preserves the row names of your original matrix / data frame passed to dist.

You can tune the cache blocking factor on your machine. Note that the effects of cache blocking is not evident for small matrices. Here I tried a 10000 x 10000 one.

## mimic a "dist" object without actually calling `dist`
n <- 10000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

system.time(A <- dist2mat(x, 64))
#   user  system elapsed 
#  0.676   0.424   1.113 

system.time(A <- dist2mat(x, 128))
#   user  system elapsed 
#  0.532   0.140   0.672 

system.time(A <- dist2mat(x, 256))
#   user  system elapsed 
#  0.616   0.140   0.759 

We can benchmark dist2mat with as.matrix. As as.matrix is RAM consuming, I use a small example here.

## mimic a "dist" object without actually calling `dist`
n <- 2000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

library(bench)
bench::mark(dist2mat(x, 128), as.matrix(x), check = FALSE)
## A tibble: 2 x 14
#  expression         min   mean  median     max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>         <bch:tm> <bch:> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#1 dist2mat(x, …   24.6ms   26ms  25.8ms  37.1ms     38.4     30.5MB     0    20
#2 as.matrix(x)   154.5ms  155ms 154.8ms 154.9ms      6.46   160.3MB     0     4
## ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
##   time <list>, gc <list>

Note how dist2mat is faster (see "mean", "median"), and also how less RAM (see "mem_alloc") it needs. I've set check = FALSE to disable result checking, because dist2mat returns no "dimnames" attribute (the "dist" object has no such info) but as.matrix does (it sets 1:2000 as "dimnames"), so they are not exactly equal. But you can verify that they are both correct.

A <- dist2mat(x, 128)
B <- as.matrix(x)
range(A - B)
#[1] 0 0

"dist2mat.cpp"

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix dist2mat(NumericVector& x, int bf) {

  /* input validation */
  if (!x.inherits("dist")) stop("Input must be a 'dist' object");
  int n = x.attr("Size");
  if (n > 65536) stop("R cannot create a square matrix larger than 65536 x 65536");

  /* initialization */
  NumericMatrix A(n, n);

  /* use pointers */
  size_t j, i, jj, ni, nj; double *ptr_x = &x[0];
  double *A_jj, *A_ij, *A_ji, *col, *row, *end;

  /* fill in lower triangular part */
  for (j = 0; j < n; j++) {
    col = &A(j + 1, j); end = &A(n, j);
    while (col < end) *col++ = *ptr_x++;
    }

  /* cache blocking factor */
  size_t b = (size_t)bf;

  /* copy lower triangular to upper triangular; cache blocking applied */
  for (j = 0; j < n; j += b) {
    nj = n - j; if (nj > b) nj = b;
    /* diagonal block has size nj x nj */
    A_jj = &A(j, j);
    for (jj = nj - 1; jj > 0; jj--, A_jj += n + 1) {
      /* copy a column segment to a row segment */
      col = A_jj + 1; row = A_jj + n;
      for (end = col + jj; col < end; col++, row += n) *row = *col;
      }
    /* off-diagonal blocks */
    for (i = j + nj; i < n; i += b) {
      ni = n - i; if (ni > b) ni = b;
      /* off-diagonal block has size ni x nj */
      A_ij = &A(i, j); A_ji = &A(j, i);
      for (jj = 0; jj < nj; jj++) {
        /* copy a column segment to a row segment */
        col = A_ij + jj * n; row = A_ji + jj;
        for (end = col + ni; col < end; col++, row += n) *row = *col;
        }
      }
    }

  /* add row names and column names */
  A.attr("dimnames") = List::create(x.attr("Labels"), x.attr("Labels"));

  return A;
  }


来源:https://stackoverflow.com/questions/51707582/as-matrix-on-a-distance-object-is-extremely-slow-how-to-make-it-faster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!