Fastest way to cross-tabulate two massive logical vectors in R

后端 未结 5 861
别那么骄傲
别那么骄傲 2021-02-02 16:08

For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?

I suspect the answer is to w

5条回答
  •  醉梦人生
    2021-02-02 16:19

    A different tactic is to consider just set intersections, using the indices of the TRUE values, taking advantage that the samples are very biased (i.e. mostly FALSE).

    To that end, I introduce func_find01 and a translation that uses the bit package (func_find01B); all of the code that doesn't appear in the answer above is pasted below.

    I re-ran the full N=3e8 evaluation, except forgot to use func_find01B; I reran the faster methods against it, in a second pass.

              test replications elapsed   relative user.self sys.self
    6   logical3B1            1   1.298   1.000000      1.13     0.17
    4    logicalB1            1   1.805   1.390601      1.57     0.23
    7   logical3B2            1   2.317   1.785054      2.12     0.20
    5    logicalB2            1   2.820   2.172573      2.53     0.29
    2       find01            1   6.125   4.718798      4.24     1.88
    9 bigtabulate2            1  22.823  17.583205     21.00     1.81
    3      logical            1  23.800  18.335901     15.51     8.28
    8  bigtabulate            1  27.674  21.320493     24.27     3.40
    1        table            1 183.467 141.345917    149.01    34.41
    

    Just the "fast" methods:

            test replications elapsed relative user.self sys.self
    3     find02            1   1.078 1.000000      1.03     0.04
    6 logical3B1            1   1.312 1.217069      1.18     0.13
    4  logicalB1            1   1.797 1.666976      1.58     0.22
    2    find01B            1   2.104 1.951763      2.03     0.08
    7 logical3B2            1   2.319 2.151206      2.13     0.19
    5  logicalB2            1   2.817 2.613173      2.50     0.31
    1     find01            1   6.143 5.698516      4.21     1.93
    

    So, find01B is fastest among methods that do not use pre-converted bit vectors, by a slim margin (2.099 seconds versus 2.327 seconds). Where did find02 come from? I subsequently wrote a version that uses pre-computed bit vectors. This is now the fastest.

    In general, the running time of the "indices method" approach may be affected by the marginal & joint probabilities. I suspect that it would be especially competitive when the probabilities are even lower, but one has to know that a priori, or via a sub-sample.


    Update 1. I've also timed Josh O'Brien's suggestion, using tabulate() instead of table(). The results, at 12 seconds elapsed, are about 2X find01 and about half of bigtabulate2. Now that the best methods are approaching 1 second, this is also relatively slow:

     user  system elapsed 
    7.670   5.140  12.815 
    

    Code:

    func_find01 <- function(v1, v2){
        ix1 <- which(v1 == TRUE)
        ix2 <- which(v2 == TRUE)
    
        len_ixJ <- sum(ix1 %in% ix2)
        len1    <- length(ix1)
        len2    <- length(ix2)
        return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
                 length(v1) - len1 - len2 + len_ixJ))
    }
    
    func_find01B <- function(v1, v2){
        v1b = as.bit(v1)
        v2b = as.bit(v2)
    
        len_ixJ <- sum(v1b & v2b)
        len1 <- sum(v1b)
        len2 <- sum(v2b)
    
        return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
                 length(v1) - len1 - len2 + len_ixJ))
    }
    
    func_find02 <- function(v1b, v2b){
        len_ixJ <- sum(v1b & v2b)
        len1 <- sum(v1b)
        len2 <- sum(v2b)
    
        return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
                 length(v1b) - len1 - len2 + len_ixJ))
    }
    
    func_bigtabulate2    <- function(v1,v2){
        return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
    }
    
    func_tabulate01 <- function(v1,v2){
        return(tabulate(1L + 1L*x + 2L*y))
    }
    
    benchmark(replications = 1, order = "elapsed", 
        table = {res <- func_table(x,y)},
        find01  = {res <- func_find01(x,y)},
        find01B  = {res <- func_find01B(x,y)},
        find02  = {res <- func_find01B(xb,yb)},
        logical = {res <- func_logical(x,y)},
        logicalB1 = {res <- func_logical(xb,yb)},
        logicalB2 = {res <- func_logicalB(x,y)},
    
        logical3B1 = {res <- func_logical3(xb,yb)},
        logical3B2 = {res <- func_logical3B(x,y)},
    
        tabulate    = {res <- func_tabulate(x,y)},
        bigtabulate = {res <- func_bigtabulate(x,y)},
        bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
    )
    

提交回复
热议问题