问题
I found an R package Rlof
which uses multithreading to calculate distance matrices and it does a wonderful job.
However, the output of the function distmc
is a vector rather than a matrix. Applying as.matrix
to this "dist" object turns out much more expensive than the multi-threaded calculation of distances.
Looking at the function help, the options for printing the diagonal and upper triangle are there, but I don't understand where they are supposed to be used.
Is there a way of saving the time of as.matrix
somehow?
Reproducible example:
set.seed(42)
M1 = matrix(rnorm(15000*20), nrow = 15000, ncol =20)
system.time({dA = distmc(M1, method = "euclidean", diag = TRUE,
upper = TRUE, p = 2)})
system.time(A = as.matrix(dA))
回答1:
What does dist
return?
This function always returns a vector, holding the lower triangular part (by column) of the full matrix. The diag
or upper
argument only affects printing, i.e., stats:::print.dist
. This function can print this vector as if it were a matrix; but it is really not.
Why is as.matrix
bad for a "dist" object?
It is hard to efficiently work with triangular matrices or to further make them symmetric in R core. lower.tri
and upper.tri
can be memory consuming if your matrix is large: R: Convert upper triangular part of a matrix to symmetric matrix.
Converting a "dist" object to a matrix is worse. Look at the code of stats:::as.matrix.dist
:
function (x, ...)
{
size <- attr(x, "Size")
df <- matrix(0, size, size)
df[row(df) > col(df)] <- x
df <- df + t(df)
labels <- attr(x, "Labels")
dimnames(df) <- if (is.null(labels))
list(seq_len(size), seq_len(size))
else list(labels, labels)
df
}
The use of row
, col
and t
are a nightmare. And the final "dimnames<-"
generates another big temporary matrix object. When memory becomes a bottleneck, everything is slow.
But we still possibly need a full matrix, as it is easy to use.
The awkward thing is that it is easier to work with a full matrix so we want it. Consider this example: R - How to get row & column subscripts of matched elements from a distance matrix. Operations are tricky if we try to avoid forming the complete matrix.
An Rcpp solution
I write an Rcpp function dist2mat
(see "dist2mat.cpp" source file in the end of this answer).
The function has two inputs: a "dist" object x
and a (integer) cache blocking factor bf
. The function works by first creating a matrix and fill in its lower triangular part, then copying the lower triangular part to upper triangular to make it symmetric. The second step is a typical transposition operation and for large matrix cache blocking is worthwhile. Performance should be insensitive to the cache blocking factor as long as it is not too small or too large. 128 or 256 is generally a good choice.
This is my first try with Rcpp. I have been a C programmer using R's conventional C interface. But I am on my way to get familiar with Rcpp as well. Given that you don't know how to write compiled code, your probably also don't know how to run Rcpp functions. You need to
- install
Rcpp
package (not sure if you further need Rtools if you are on Windows); - copy my "dist2mat.cpp" into a file under your R's current working directory (you can get it from
getwd()
in your R session). A ".cpp" file is just a plain text file so you can create, edit and save it with any text editor.
Now let's start the showcase.
library(Rcpp)
sourceCpp("dist2mat.cpp") ## this takes some time; be patient
## a simple test with `dist2mat`
set.seed(0)
x <- dist(matrix(runif(10), nrow = 5, dimnames = list(letters[1:5], NULL)))
A <- dist2mat(x, 128) ## cache blocking factor = 128
A
# a b c d e
#a 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#b 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#c 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#d 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#e 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000
The resulting matrix preserves the row names of your original matrix / data frame passed to dist
.
You can tune the cache blocking factor on your machine. Note that the effects of cache blocking is not evident for small matrices. Here I tried a 10000 x 10000 one.
## mimic a "dist" object without actually calling `dist`
n <- 10000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)
system.time(A <- dist2mat(x, 64))
# user system elapsed
# 0.676 0.424 1.113
system.time(A <- dist2mat(x, 128))
# user system elapsed
# 0.532 0.140 0.672
system.time(A <- dist2mat(x, 256))
# user system elapsed
# 0.616 0.140 0.759
We can benchmark dist2mat
with as.matrix
. As as.matrix
is RAM consuming, I use a small example here.
## mimic a "dist" object without actually calling `dist`
n <- 2000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)
library(bench)
bench::mark(dist2mat(x, 128), as.matrix(x), check = FALSE)
## A tibble: 2 x 14
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int>
#1 dist2mat(x, … 24.6ms 26ms 25.8ms 37.1ms 38.4 30.5MB 0 20
#2 as.matrix(x) 154.5ms 155ms 154.8ms 154.9ms 6.46 160.3MB 0 4
## ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## time <list>, gc <list>
Note how dist2mat
is faster (see "mean", "median"), and also how less RAM (see "mem_alloc") it needs. I've set check = FALSE
to disable result checking, because dist2mat
returns no "dimnames" attribute (the "dist" object has no such info) but as.matrix
does (it sets 1:2000
as "dimnames"), so they are not exactly equal. But you can verify that they are both correct.
A <- dist2mat(x, 128)
B <- as.matrix(x)
range(A - B)
#[1] 0 0
"dist2mat.cpp"
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix dist2mat(NumericVector& x, int bf) {
/* input validation */
if (!x.inherits("dist")) stop("Input must be a 'dist' object");
int n = x.attr("Size");
if (n > 65536) stop("R cannot create a square matrix larger than 65536 x 65536");
/* initialization */
NumericMatrix A(n, n);
/* use pointers */
size_t j, i, jj, ni, nj; double *ptr_x = &x[0];
double *A_jj, *A_ij, *A_ji, *col, *row, *end;
/* fill in lower triangular part */
for (j = 0; j < n; j++) {
col = &A(j + 1, j); end = &A(n, j);
while (col < end) *col++ = *ptr_x++;
}
/* cache blocking factor */
size_t b = (size_t)bf;
/* copy lower triangular to upper triangular; cache blocking applied */
for (j = 0; j < n; j += b) {
nj = n - j; if (nj > b) nj = b;
/* diagonal block has size nj x nj */
A_jj = &A(j, j);
for (jj = nj - 1; jj > 0; jj--, A_jj += n + 1) {
/* copy a column segment to a row segment */
col = A_jj + 1; row = A_jj + n;
for (end = col + jj; col < end; col++, row += n) *row = *col;
}
/* off-diagonal blocks */
for (i = j + nj; i < n; i += b) {
ni = n - i; if (ni > b) ni = b;
/* off-diagonal block has size ni x nj */
A_ij = &A(i, j); A_ji = &A(j, i);
for (jj = 0; jj < nj; jj++) {
/* copy a column segment to a row segment */
col = A_ij + jj * n; row = A_ji + jj;
for (end = col + ni; col < end; col++, row += n) *row = *col;
}
}
}
/* add row names and column names */
A.attr("dimnames") = List::create(x.attr("Labels"), x.attr("Labels"));
return A;
}
来源:https://stackoverflow.com/questions/51707582/as-matrix-on-a-distance-object-is-extremely-slow-how-to-make-it-faster