Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices

感情迁移 提交于 2019-11-28 11:43:16

I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:

"Error in double(N * (N - 1)/2) : vector size specified is too large" 
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"

The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!

FunDistanceMatrixChunking <- function (df, blockSize=100){
  n <- nrow(df)
  blocks <- n %/% blockSize
  if((n %% blockSize) > 0)blocks <- blocks + 1
  chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
  dex <- 1:blockSize
  chunk <- 0
  for(i in 1:blocks){    
    p <- dex + (i-1)*blockSize
    lex <- (blockSize+1):(2*blockSize)
    lex <- lex[p<= n]
    p <- p[p<= n]
    for(j in 1:blocks){
      q <- dex +(j-1)*blockSize
      q <- q[q<=n]     
      if (i == j) {       
        chunk <- chunk+1
        x <- dist(df[p,])
        chunk.means[chunk,] <- c(length(x), mean(x))}
      if ( i > j) {
        chunk <- chunk+1
        x <- as.matrix(dist(df[c(q,p),]))[lex,dex] 
        chunk.means[chunk,] <- c(length(x), mean(x))}
    }
  }
  mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
  return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)

Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..

A few thoughts:

  • unique(df[1]) probably works (by ignoring the data.frame property of your list), but makes me nervous and is hard to read. unique(df[,1]) would be better.
  • for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]} can be done with split.
  • If you're worried about memory, definitely don't store the entire distance matrix for every level, then calculate your summary statistics for every factor level! Change your lapply to something like: lapply (df.l, function(x) mean(dist(x[,2:length(x)], method="minkowski", p=2))).

If you need more than one summary statistic, calculate both and return a list:

lapply (df.l, function(x) {
   dmat <- dist(x[,2:length(x)], method="minkowski", p=2)
   list( mean=mean(dmat), median=median(dmat) )
})

See if that gets you anywhere. If not, you may have to go more specialized (avoiding lapply, storing your data.frames as matrices instead, etc.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!