For two logical vectors, x
and y
, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?
I suspect the answer is to w
A different tactic is to consider just set intersections, using the indices of the TRUE
values, taking advantage that the samples are very biased (i.e. mostly FALSE
).
To that end, I introduce func_find01
and a translation that uses the bit
package (func_find01B
); all of the code that doesn't appear in the answer above is pasted below.
I re-ran the full N=3e8 evaluation, except forgot to use func_find01B
; I reran the faster methods against it, in a second pass.
test replications elapsed relative user.self sys.self
6 logical3B1 1 1.298 1.000000 1.13 0.17
4 logicalB1 1 1.805 1.390601 1.57 0.23
7 logical3B2 1 2.317 1.785054 2.12 0.20
5 logicalB2 1 2.820 2.172573 2.53 0.29
2 find01 1 6.125 4.718798 4.24 1.88
9 bigtabulate2 1 22.823 17.583205 21.00 1.81
3 logical 1 23.800 18.335901 15.51 8.28
8 bigtabulate 1 27.674 21.320493 24.27 3.40
1 table 1 183.467 141.345917 149.01 34.41
Just the "fast" methods:
test replications elapsed relative user.self sys.self
3 find02 1 1.078 1.000000 1.03 0.04
6 logical3B1 1 1.312 1.217069 1.18 0.13
4 logicalB1 1 1.797 1.666976 1.58 0.22
2 find01B 1 2.104 1.951763 2.03 0.08
7 logical3B2 1 2.319 2.151206 2.13 0.19
5 logicalB2 1 2.817 2.613173 2.50 0.31
1 find01 1 6.143 5.698516 4.21 1.93
So, find01B
is fastest among methods that do not use pre-converted bit vectors, by a slim margin (2.099 seconds versus 2.327 seconds). Where did find02
come from? I subsequently wrote a version that uses pre-computed bit vectors. This is now the fastest.
In general, the running time of the "indices method" approach may be affected by the marginal & joint probabilities. I suspect that it would be especially competitive when the probabilities are even lower, but one has to know that a priori, or via a sub-sample.
Update 1. I've also timed Josh O'Brien's suggestion, using tabulate()
instead of table()
. The results, at 12 seconds elapsed, are about 2X find01
and about half of bigtabulate2
. Now that the best methods are approaching 1 second, this is also relatively slow:
user system elapsed
7.670 5.140 12.815
Code:
func_find01 <- function(v1, v2){
ix1 <- which(v1 == TRUE)
ix2 <- which(v2 == TRUE)
len_ixJ <- sum(ix1 %in% ix2)
len1 <- length(ix1)
len2 <- length(ix2)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find01B <- function(v1, v2){
v1b = as.bit(v1)
v2b = as.bit(v2)
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find02 <- function(v1b, v2b){
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1b) - len1 - len2 + len_ixJ))
}
func_bigtabulate2 <- function(v1,v2){
return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}
func_tabulate01 <- function(v1,v2){
return(tabulate(1L + 1L*x + 2L*y))
}
benchmark(replications = 1, order = "elapsed",
table = {res <- func_table(x,y)},
find01 = {res <- func_find01(x,y)},
find01B = {res <- func_find01B(x,y)},
find02 = {res <- func_find01B(xb,yb)},
logical = {res <- func_logical(x,y)},
logicalB1 = {res <- func_logical(xb,yb)},
logicalB2 = {res <- func_logicalB(x,y)},
logical3B1 = {res <- func_logical3(xb,yb)},
logical3B2 = {res <- func_logical3B(x,y)},
tabulate = {res <- func_tabulate(x,y)},
bigtabulate = {res <- func_bigtabulate(x,y)},
bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)