R: fast sliding window with given coordinates

前端 未结 2 617
耶瑟儿~
耶瑟儿~ 2021-02-08 19:09

I have a data table with nrow being around a million or two and ncol of about 200.

Each entry in a row has a coordinate associated with it.

Tiny portion of the d

相关标签:
2条回答
  • 2021-02-08 19:27

    Data generation:

    N <- 1e5 # rows
    M <- 200 # columns
    W <- 10  # window size
    
    set.seed(1)
    intensities <- matrix(rnorm(N*M), nrow=N, ncol=M)
    coords <- 8000000 + sort(sample(1:(5*N), N))
    

    Original function with minor modifications I used for benchmarks:

    doSlidingWindow <- function(intensities, coords, windsize) {
      windHalfSize <- ceiling(windsize/2)
      ### whole range inds
      RANGE <- integer(max(coords)+windsize)
      RANGE[coords] <- c(1:length(coords)[1])
    
      ### get indices of rows falling in each window
      ### NOTE: Each elements of WINDOWINDS holds zero. Not a big problem though.
      WINDOWINDS <- sapply(coords, function(crds) ret <- unique(RANGE[(crds-windHalfSize):(crds+windHalfSize)]))
    
      ### do windowing
      wind_ints <- intensities
      wind_ints[] <- 0
      for(i in 1:length(coords)) {
        # CORRECTION: When it's only one row in window there was a trouble
        wind_ints[i,] <- apply(matrix(intensities[WINDOWINDS[[i]],], ncol=ncol(intensities)), 2, mean)
      }
      return(wind_ints)
    }
    

    POSSIBLE SOLUTIONS:


    1) data.table

    data.table is known to be fast with subsetting, but this page (and other related to sliding window) suggests, that this is not the case. Indeed, data.table code is elegant, but unfortunately very slow:

    require(data.table)
    require(plyr)
    dt <- data.table(coords, intensities)
    setkey(dt, coords)
    aaply(1:N, 1, function(i) dt[WINDOWINDS[[i]], sapply(.SD,mean), .SDcols=2:(M+1)])
    

    2) foreach+doSNOW

    Basic routine is easy to run in parallel, so, we can benefit from it:

    require(doSNOW)
    doSlidingWindow2 <- function(intensities, coords, windsize) {
      NC <- 2 # number of nodes in cluster
      cl <- makeCluster(rep("localhost", NC), type="SOCK")
      registerDoSNOW(cl)
    
      N <- ncol(intensities) # total number of columns
      chunk <- ceiling(N/NC) # number of columns send to the single node
    
      result <- foreach(i=1:NC, .combine=cbind, .export=c("doSlidingWindow")) %dopar% {
        start <- (i-1)*chunk+1
        end   <- ifelse(i!=NC, i*chunk, N)
        doSlidingWindow(intensities[,start:end], coords, windsize)    
      }
    
      stopCluster(cl)
      return (result)
    }
    

    Benchmark shows notable speed-up on my Dual-Core processor:

    system.time(res <- doSlidingWindow(intensities, coords, W))
    #    user  system elapsed 
    # 306.259   0.204 307.770
    system.time(res2 <- doSlidingWindow2(intensities, coords, W))
    #  user  system elapsed 
    # 1.377   1.364 177.223
    all.equal(res, res2, check.attributes=FALSE)
    # [1] TRUE
    

    3) Rcpp

    Yes, I know you asked "without going to C". But, please, take a look. This code is inline and rather straightforward:

    require(Rcpp)
    require(inline)
    doSlidingWindow3 <- cxxfunction(signature(intens="matrix", crds="numeric", wsize="numeric"), plugin="Rcpp", body='
      #include <vector>
      Rcpp::NumericMatrix intensities(intens);
      const int N = intensities.nrow();
      const int M = intensities.ncol();
      Rcpp::NumericMatrix wind_ints(N, M);
    
      std::vector<int> coords = as< std::vector<int> >(crds);
      int windsize = ceil(as<double>(wsize)/2);  
    
      for(int i=0; i<N; i++){
        // Simple search for window range (begin:end in coords)
        // Assumed that coords are non-decreasing
        int begin = (i-windsize)<0?0:(i-windsize);
        while(coords[begin]<(coords[i]-windsize)) ++begin;
        int end = (i+windsize)>(N-1)?(N-1):(i+windsize);
        while(coords[end]>(coords[i]+windsize)) --end;
    
        for(int j=0; j<M; j++){
          double result = 0.0;
          for(int k=begin; k<=end; k++){
            result += intensities(k,j);
          }
          wind_ints(i,j) = result/(end-begin+1);
        }
      }
    
      return wind_ints;
    ')
    

    Benchmark:

    system.time(res <- doSlidingWindow(intensities, coords, W))
    #    user  system elapsed 
    # 306.259   0.204 307.770
    system.time(res3 <- doSlidingWindow3(intensities, coords, W))
    #  user  system elapsed 
    # 0.328   0.020   0.351
    all.equal(res, res3, check.attributes=FALSE)
    # [1] TRUE
    

    I hope results are quite motivating. While data fits in memory Rcpp version is pretty fast. Say, with N <- 1e6 and M <-100 I got:

       user  system elapsed 
      2.873   0.076   2.951
    

    Naturally, after R starts using swap everything slows down. With really large data that doesn't fit in memory you should consider sqldf, ff or bigmemory.

    0 讨论(0)
  • 2021-02-08 19:52

    Rollapply works great with a small dataset. However, if you are working with several million rows (genomics) it is quite slow.

    The following function is super fast:

    data <- c(runif(100000, min=0, max=.1),runif(100000, min=.05, max=.1),runif(10000, min=.05, max=1), runif(100000, min=0, max=.2))
    slideFunct <- function(data, window, step){
      total <- length(data)
      spots <- seq(from=1, to=(total-window), by=step)
      result <- vector(length = length(spots))
      for(i in 1:length(spots)){
        result[i] <- mean(data[spots[i]:(spots[i]+window)])
      }
      return(result)
    }
    

    Details here.

    0 讨论(0)
提交回复
热议问题