I want to cross-join two data tables without evaluating the full cross join, using a ranging criterion in the process. In essence, I would like CJ with filtering/ranging ex
This seems like a problem that could benefit a lot from using interval trees
algorithm. A very nice implementation is available from the bioconductor package IRanges.
# Installation
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
# solution
require(IRanges)
ir1 <- IRanges(dt1$D, width=1L)
ir2 <- IRanges(dt2$D1, dt2$D2)
olaps <- findOverlaps(ir1, ir2, type="within")
cbind(dt1[queryHits(olaps)], dt2[subjectHits(olaps)])
id1 D id2 D1 D2
1: 3 6 21 5 9
2: 4 8 21 5 9
3: 4 8 22 7 12
4: 5 10 22 7 12
5: 5 10 23 10 16
6: 6 12 22 7 12
7: 6 12 23 10 16
8: 7 14 23 10 16
9: 8 16 23 10 16
Recently, overlap joins are implemented in data.table
. This is a special case where dt1
's `start and end points are identical. You can grab the latest version from the github project page to try this out:
require(data.table) ## 1.9.3+
dt1[, DD := D] ## duplicate column D to create intervals
setkey(dt2, D1,D2) ## key needs to be set for 2nd argument
foverlaps(dt1, dt2, by.x=c("D", "DD"), by.y=key(dt2), nomatch=0L)
# id2 D1 D2 id1 D DD
# 1: 21 5 9 3 6 6
# 2: 21 5 9 4 8 8
# 3: 22 7 12 4 8 8
# 4: 22 7 12 5 10 10
# 5: 23 10 16 5 10 10
# 6: 22 7 12 6 12 12
# 7: 23 10 16 6 12 12
# 8: 23 10 16 7 14 14
# 9: 23 10 16 8 16 16
Here's the results benchmarking on the same data you've shown in your post:
# Unit: seconds
# expr min lq median uq max neval
# olaps 0.03600603 0.03971068 0.04341533 0.04857602 0.05373671 3
# bioTreeRangeRes 0.11356837 0.11673968 0.11991100 0.12499391 0.13007681 3
# dtJoin2 2.61679908 2.70327940 2.78975971 2.86864832 2.94753693 3
# fullCJ 4.45173294 4.75271285 5.05369275 5.08333291 5.11297307 3
# dtJoin1 16.51898878 17.39207632 18.26516387 18.60092303 18.93668220 3
# manualIter 29.36023340 30.13354967 30.90686594 33.55910653 36.21134712 3
where dt_olaps
is:
dt_olaps <- function(dt1, dt2) {
dt1[, DD := D]
setkey(dt2, D1,D2)
foverlaps(dt1, dt2, by.x=c("D","DD"), by.y=key(dt2), nomatch=0L)
}