For two logical vectors, x
and y
, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?
I suspect the answer is to w
If you're doing a lot of operations on huge logical vectors, take a look at the bit package. It saves a ton of memory by storing the booleans as true 1-bit booleans.
This doesn't help with table
; it actually makes it worse because there are more unique values in the bit vector due to how it's constructed. But it really helps with logical comparisons.
# N <- 3e7
require(bit)
xb <- as.bit(x)
yb <- as.bit(y)
benchmark(replications = 1, order = "elapsed",
bit = {res <- func_logical(xb,yb)},
logical = {res <- func_logical(x,y)}
)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 bit 1 0.129 1.00000 0.132 0.000 0 0
# 2 logical 1 3.677 28.50388 2.684 0.928 0 0
Here is an answer using Rcpp sugar.
N <- 1e8
x <- sample(c(T,F),N,replace=T)
y <- sample(c(T,F),N,replace=T)
func_logical <- function(v1,v2){
return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}
library(Rcpp)
library(inline)
doCrossTab1 <- cxxfunction(signature(x="integer", y = "integer"), body='
Rcpp::LogicalVector Vx(x);
Rcpp::LogicalVector Vy(y);
Rcpp::IntegerVector V(4);
V[0] = sum(Vx*Vy);
V[1] = sum(Vx*!Vy);
V[2] = sum(!Vx*Vy);
V[3] = sum(!Vx*!Vy);
return( wrap(V));
'
, plugin="Rcpp")
system.time(doCrossTab1(x,y))
require(bit)
system.time(
{
xb <- as.bit(x)
yb <- as.bit(y)
func_logical(xb,yb)
})
which results in:
> system.time(doCrossTab1(x,y))
user system elapsed
1.067 0.002 1.069
> system.time(
+ {
+ xb <- as.bit(x)
+ yb <- as.bit(y)
+ func_logical(xb,yb)
+ })
user system elapsed
1.451 0.001 1.453
So, we can get a little speed up over the bit package, though I'm surprised at how competitive the times are.
Update: In honor of Iterator, here is a Rcpp iterator solution:
doCrossTab2 <- cxxfunction(signature(x="integer", y = "integer"), body='
Rcpp::LogicalVector Vx(x);
Rcpp::LogicalVector Vy(y);
Rcpp::IntegerVector V(4);
V[0]=V[1]=V[2]=V[3]=0;
LogicalVector::iterator itx = Vx.begin();
LogicalVector::iterator ity = Vy.begin();
while(itx!=Vx.end()){
V[0] += (*itx)*(*ity);
V[1] += (*itx)*(!*ity);
V[2] += (!*itx)*(*ity);
V[3] += (!*itx)*(!*ity);
itx++;
ity++;
}
return( wrap(V));
'
, plugin="Rcpp")
system.time(doCrossTab2(x,y))
# user system elapsed
# 0.780 0.001 0.782
A different tactic is to consider just set intersections, using the indices of the TRUE
values, taking advantage that the samples are very biased (i.e. mostly FALSE
).
To that end, I introduce func_find01
and a translation that uses the bit
package (func_find01B
); all of the code that doesn't appear in the answer above is pasted below.
I re-ran the full N=3e8 evaluation, except forgot to use func_find01B
; I reran the faster methods against it, in a second pass.
test replications elapsed relative user.self sys.self
6 logical3B1 1 1.298 1.000000 1.13 0.17
4 logicalB1 1 1.805 1.390601 1.57 0.23
7 logical3B2 1 2.317 1.785054 2.12 0.20
5 logicalB2 1 2.820 2.172573 2.53 0.29
2 find01 1 6.125 4.718798 4.24 1.88
9 bigtabulate2 1 22.823 17.583205 21.00 1.81
3 logical 1 23.800 18.335901 15.51 8.28
8 bigtabulate 1 27.674 21.320493 24.27 3.40
1 table 1 183.467 141.345917 149.01 34.41
Just the "fast" methods:
test replications elapsed relative user.self sys.self
3 find02 1 1.078 1.000000 1.03 0.04
6 logical3B1 1 1.312 1.217069 1.18 0.13
4 logicalB1 1 1.797 1.666976 1.58 0.22
2 find01B 1 2.104 1.951763 2.03 0.08
7 logical3B2 1 2.319 2.151206 2.13 0.19
5 logicalB2 1 2.817 2.613173 2.50 0.31
1 find01 1 6.143 5.698516 4.21 1.93
So, find01B
is fastest among methods that do not use pre-converted bit vectors, by a slim margin (2.099 seconds versus 2.327 seconds). Where did find02
come from? I subsequently wrote a version that uses pre-computed bit vectors. This is now the fastest.
In general, the running time of the "indices method" approach may be affected by the marginal & joint probabilities. I suspect that it would be especially competitive when the probabilities are even lower, but one has to know that a priori, or via a sub-sample.
Update 1. I've also timed Josh O'Brien's suggestion, using tabulate()
instead of table()
. The results, at 12 seconds elapsed, are about 2X find01
and about half of bigtabulate2
. Now that the best methods are approaching 1 second, this is also relatively slow:
user system elapsed
7.670 5.140 12.815
Code:
func_find01 <- function(v1, v2){
ix1 <- which(v1 == TRUE)
ix2 <- which(v2 == TRUE)
len_ixJ <- sum(ix1 %in% ix2)
len1 <- length(ix1)
len2 <- length(ix2)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find01B <- function(v1, v2){
v1b = as.bit(v1)
v2b = as.bit(v2)
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find02 <- function(v1b, v2b){
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1b) - len1 - len2 + len_ixJ))
}
func_bigtabulate2 <- function(v1,v2){
return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}
func_tabulate01 <- function(v1,v2){
return(tabulate(1L + 1L*x + 2L*y))
}
benchmark(replications = 1, order = "elapsed",
table = {res <- func_table(x,y)},
find01 = {res <- func_find01(x,y)},
find01B = {res <- func_find01B(x,y)},
find02 = {res <- func_find01B(xb,yb)},
logical = {res <- func_logical(x,y)},
logicalB1 = {res <- func_logical(xb,yb)},
logicalB2 = {res <- func_logicalB(x,y)},
logical3B1 = {res <- func_logical3(xb,yb)},
logical3B2 = {res <- func_logical3B(x,y)},
tabulate = {res <- func_tabulate(x,y)},
bigtabulate = {res <- func_bigtabulate(x,y)},
bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)
Here are results for the logical method, table
, and bigtabulate
, for N = 3E8:
test replications elapsed relative user.self sys.self
2 logical 1 23.861 1.000000 15.36 8.50
3 bigtabulate 1 36.477 1.528729 28.04 8.43
1 table 1 184.652 7.738653 150.61 33.99
In this case, table
is a disaster.
For comparison, here is N = 3E6:
test replications elapsed relative user.self sys.self
2 logical 1 0.220 1.000000 0.14 0.08
3 bigtabulate 1 0.534 2.427273 0.45 0.08
1 table 1 1.956 8.890909 1.87 0.09
At this point, it seems that writing one's own logical functions is best, even though that abuses sum
, and examines each logical vector multiple times. I've not yet tried compiling the functions, but that should yield better results.
Update 1 If we give bigtabulate
values that are already integers, i.e. if we do the type conversion 1 * cbind(v1,v2)
outside of bigtabulate, then the N=3E6 multiple is 1.80, instead of 2.4. The N=3E8 multiple relative to the "logical" method is only 1.21, instead of 1.53.
Update 2
As Joshua Ulrich has pointed out, converting to bit vectors is a significant improvement - we're allocating and moving around a LOT less data: R's logical vectors consume 4 bytes per entry ("Why?", you may ask... Well, I don't know, but an answer may turn up here.), whereas a bit vector consumes, well, one bit, per entry - i.e. 1/32 as much data. So, x
consumes 1.2e9 bytes, while xb
(the bit version in the code below) consumes only 3.75e7 bytes.
I've dropped table
and the bigtabulate
variations from the updated benchmarks (N=3e8). Note that logicalB1
assumes that the data is already a bit vector, while logicalB2
is the same operation with the penalty for type conversion. As my logical vectors are the results of operations on other data, I don't have the benefit of starting off with a bit vector. Nonetheless, the penalty to be paid is relatively small. [The "logical3" series only performs 3 logical operations, and then does a subtraction. Since it's cross-tabulation, we know the total, as DWin has remarked.]
test replications elapsed relative user.self sys.self
4 logical3B1 1 1.276 1.000000 1.11 0.17
2 logicalB1 1 1.768 1.385580 1.56 0.21
5 logical3B2 1 2.297 1.800157 2.15 0.14
3 logicalB2 1 2.782 2.180251 2.53 0.26
1 logical 1 22.953 17.988245 15.14 7.82
We've now sped this up to taking only 1.8-2.8 seconds, even with many gross inefficiencies. There is no doubt it should be feasible to do this in well under 1 second, with changes including one or more of: C code, compilation, and multicore processing. After all the 3 (or 4) different logical operations could be done independently, even though that's still a waste of compute cycles.
The most similar of the best challengers, logical3B2
, is about 80X faster than table
. It's about 10X faster than the naive logical operation. And it still has a lot of room for improvement.
Here is code to produce the above. NOTE I recommend commenting out some of the operations or vectors, unless you have a lot of RAM - the creation of x
, x1
, and xb
, along with the corresponding y
objects, will take up a fair bit of memory.
Also, note: I should have used 1L
as the integer multiplier for bigtabulate
, instead of just 1
. At some point I will re-run with this change, and would recommend that change to anyone who uses the bigtabulate
approach.
library(rbenchmark)
library(bigtabulate)
library(bit)
set.seed(0)
N <- 3E8
p <- 0.02
x <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
x1 <- 1*x
y1 <- 1*y
xb <- as.bit(x)
yb <- as.bit(y)
func_table <- function(v1,v2){
return(table(v1,v2))
}
func_logical <- function(v1,v2){
return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}
func_logicalB <- function(v1,v2){
v1B <- as.bit(v1)
v2B <- as.bit(v2)
return(c(sum(v1B & v2B), sum(v1B & !v2B), sum(!v1B & v2B), sum(!v1B & !v2B)))
}
func_bigtabulate <- function(v1,v2){
return(bigtabulate(1*cbind(v1,v2), ccols = c(1,2)))
}
func_bigtabulate2 <- function(v1,v2){
return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}
func_logical3 <- function(v1,v2){
r1 <- sum(v1 & v2)
r2 <- sum(v1 & !v2)
r3 <- sum(!v1 & v2)
r4 <- length(v1) - sum(c(r1, r2, r3))
return(c(r1, r2, r3, r4))
}
func_logical3B <- function(v1,v2){
v1B <- as.bit(v1)
v2B <- as.bit(v2)
r1 <- sum(v1B & v2B)
r2 <- sum(v1B & !v2B)
r3 <- sum(!v1B & v2B)
r4 <- length(v1) - sum(c(r1, r2, r3))
return(c(r1, r2, r3, r4))
}
benchmark(replications = 1, order = "elapsed",
#table = {res <- func_table(x,y)},
logical = {res <- func_logical(x,y)},
logicalB1 = {res <- func_logical(xb,yb)},
logicalB2 = {res <- func_logicalB(x,y)},
logical3B1 = {res <- func_logical3(xb,yb)},
logical3B2 = {res <- func_logical3B(x,y)}
#bigtabulate = {res <- func_bigtabulate(x,y)},
#bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)
Here's an answer with Rcpp
, tabulating only those entries that are not both 0
. I suspect there must be several ways to improve this, as this is unusually slow; it's also my first attempt with Rcpp
, so there may be some obvious inefficiencies associated with moving the data around. I wrote an example that is purposefully plain vanilla, which should let others demonstrate how this can be improved.
library(Rcpp)
library(inline)
doCrossTab <- cxxfunction(signature(x="integer", y = "integer"), body='
Rcpp::IntegerVector Vx(x);
Rcpp::IntegerVector Vy(y);
Rcpp::IntegerVector V(3);
for(int i = 0; i < Vx.length(); i++) {
if( (Vx(i) == 1) & ( Vy(i) == 1) ){ V[0]++; }
else if( (Vx(i) == 1) & ( Vy(i) == 0) ){ V[1]++; }
else if( (Vx(i) == 0) & ( Vy(i) == 1) ){ V[2]++; }
}
return( wrap(V));
', plugin="Rcpp")
Timing results for N = 3E8
:
user system elapsed
10.930 1.620 12.586
This takes more than 6X as long as func_find01B
in my 2nd answer.