Efficiently perform row-wise distribution test

与世无争的帅哥 提交于 2019-11-30 19:38:35
Khashaa

A quick and dirty implementation in Rcpp

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h> 

double KS(arma::colvec x, arma::colvec y) {
  int n = x.n_rows;
  arma::colvec w = join_cols(x, y);
  arma::uvec z = arma::sort_index(w);
  w.fill(-1); w.elem( find(z <= n-1) ).ones();
  return max(abs(cumsum(w)))/n;
}
// [[Rcpp::export]]
Rcpp::NumericVector K_S(arma::mat mt) {
  int n = mt.n_cols; 
  Rcpp::NumericVector results(n);
  for (int i=1; i<n;i++) {
    arma::colvec x=mt.col(i-1);
    arma::colvec y=mt.col(i);
    results[i] = KS(x, y);
    }
  return results;
}

for matrix of size (400, 30000), it completes under 1s.

system.time(K_S(t(mt)))[3]
#elapsed 
#   0.98 

And the result seems to be accurate.

set.seed(1942)
mt <- matrix(rnorm(400*30000), nrow=30000)
results <- rep(0, nrow(mt))
for (i in 2 : nrow(mt)) {
  results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
}
result <- K_S(t(mt))
all.equal(result, results)
#[1] TRUE

One source of speed up is to write a smaller version of ks.test that does less. ks.test2 below is more restrictive than ks.test. It assumes, for example, that you have no missing values and that you always want the statistic associated with a two-sided test.

ks.test2 <- function(x, y){

  n.x <- length(x)
  n.y <- length(y)
  w <- c(x, y)
  z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))

  max(abs(z))

}

Verify that the output is consistent with ks.test.

set.seed(999)
x <- rnorm(400)
y <- rnorm(400)

ks.test(x, y)$statistic

    D 
0.045

ks.test2(x, y)

[1] 0.045

Now determine the savings from the smaller function:

library(microbenchmark)

microbenchmark(
  ks.test(x, y),
  ks.test2(x, y)
  )

Unit: microseconds
           expr      min       lq      mean   median        uq      max neval cld
  ks.test(x, y) 1030.238 1070.303 1347.3296 1227.207 1313.8490 6338.918   100   b
 ks.test2(x, y)  709.719  730.048  832.9532  833.861  888.5305 1281.284   100  a 

I was able to compute the pairwise Kruskal-Wallis statistic using ks.test() with rollapplyr().

results <- rollapplyr(data = big,
                      width = 2,
                      FUN = function(x) ks.test(x[1, ], x[2, ])$statistic,
                      by.column = FALSE)

This gets the expected result, but it's slow for a dataset of your size. Slow slow slow. This may be because ks.test() is computing a lot more than just the statistic at each iteration; it also gets the p-value and does a lot of error checking.

Indeed, if we simulate a large dataset like so:

big <- NULL
for (i in 1:400) {
    big <- cbind(big, rnorm(300000))
}

The rollapplyr() solution takes a long time; I halted execution after about 2 hours, at which point it had computed nearly all (but not all) results.

It seems that while rollapplyr() is likely faster than a for loop, it will not likely be the best overall solution in terms of performance.

Here's a dplyr solution that gets the same result as your loop. I have my doubts if this is actually faster than the loop, but perhaps it can serve as a first step towards a solution.

require(dplyr)
mt %>% 
  as.data.frame %>%
  mutate_each(funs(lag)) %>%
  cbind(mt) %>%
  slice(-1) %>%
  rowwise %>%
  do({
    x = unlist(.)
    n <- length(x)
    data.frame(ks = ks.test(head(x, n/2), tail(x, n/2))$statistic)
  }) %>%
  unlist %>%
  c(NA, .) %>%
  matrix
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!