问题
I have to work with big.matrix objects and I can’t compute some functions. Let's consider the following big.matrix:
# create big.matrix object
x <- as.big.matrix(
matrix( sample(1:10, 20, replace=TRUE), 5, 4,
dimnames=list( NULL, c("a", "b", "c", "d")) ) )
> x
An object of class "big.matrix"
Slot "address":
<pointer: 0x00000000141beee0>
The corresponding matrix object is:
# create matrix object
x2<-x[,]
> x2
a b c d
[1,] 6 9 5 3
[2,] 3 6 10 8
[3,] 7 1 2 8
[4,] 7 8 4 10
[5,] 6 3 6 4
If I compute this operations with the matrix object, it works:
sqrt(slam::col_sums(x2*x2))
> sqrt(slam::col_sums(x2*x2))
a b c d
13.37909 13.82027 13.45362 15.90597
While if I use the big.matrix object (in fact what I have to use), it doesn’t work:
sqrt(biganalytics::colsum(x*x))
The problems are 2 : the * operation (to create the square of each element of the matrix), which produces the error:
Error in x * x : non-numeric argument transformed into binary operator
and the sqrt function, which produces the error :
Error in sqrt(x) : non-numeric argument to mathematical function.
How can I compute this operations with big.matrix objects?
回答1:
With big.matrix
objects, I found 2 solutions that offer good performances:
- code a function in Rcpp for what you specifically need. Here, 2 nested for loops would do the trick. Yet, you can't recode everything you need.
- use an R function on column blocks of your
big.matrix
and aggregate the results. It is easy to do and uses R code only.
In your case, with 10,000 times more columns:
require(bigmemory)
x <- as.big.matrix(
matrix( sample(1:10, 20000, replace=TRUE), 5, 40000,
dimnames=list( NULL, rep(c("a", "b", "c", "d"), 10000) ) ) )
print(system.time(
true <- sqrt(colSums(x[,]^2))
))
print(system.time(
test1 <- biganalytics::apply(x, 2, function(x) {sqrt(sum(x^2))})
))
print(all.equal(test1, true))
So, colSums
is very fast but needs all the matrix in the RAM, whereas biganalytics::apply
is slow, but memory-efficient. A compromise would be to use something like this:
CutBySize <- function(m, block.size, nb = ceiling(m / block.size)) {
int <- m / nb
upper <- round(1:nb * int)
lower <- c(1, upper[-nb] + 1)
size <- c(upper[1], diff(upper))
cbind(lower, upper, size)
}
seq2 <- function(lims) seq(lims["lower"], lims["upper"])
require(foreach)
big_aggregate <- function(X, FUN, .combine, block.size = 1e3) {
intervals <- CutBySize(ncol(X), block.size)
foreach(k = 1:nrow(intervals), .combine = .combine) %do% {
FUN(X[, seq2(intervals[k, ])])
}
}
print(system.time(
test2 <- big_aggregate(x, function(X) sqrt(colSums(X^2)), .combine = 'c')
))
print(all.equal(test2, true))
Edit: This is now implemented in package bigstatsr:
print(system.time(
test2 <- bigstatsr::big_apply(x, a.FUN = function(X, ind) {
sqrt(colSums(X[, ind]^2))
}, a.combine = 'c')
))
print(all.equal(test2, true))
回答2:
I don't know if it's the fastest way to do it, by try with:
biganalytics::apply(x, 2, function(x) {sqrt(sum(x^2))})
来源:https://stackoverflow.com/questions/42111876/operating-with-big-matrix