Ranged/Filtered Cross Join with R data.table

后端 未结 2 1090
隐瞒了意图╮
隐瞒了意图╮ 2021-01-05 13:50

I want to cross-join two data tables without evaluating the full cross join, using a ranging criterion in the process. In essence, I would like CJ with filtering/ranging ex

相关标签:
2条回答
  • 2021-01-05 14:28

    This seems like a problem that could benefit a lot from using interval trees algorithm. A very nice implementation is available from the bioconductor package IRanges.

    # Installation
    source("http://bioconductor.org/biocLite.R")
    biocLite("IRanges")
    
    # solution
    require(IRanges)
    ir1 <- IRanges(dt1$D, width=1L)
    ir2 <- IRanges(dt2$D1, dt2$D2)
    
    olaps <- findOverlaps(ir1, ir2, type="within")
    cbind(dt1[queryHits(olaps)], dt2[subjectHits(olaps)])
    
       id1  D id2 D1 D2
    1:   3  6  21  5  9
    2:   4  8  21  5  9
    3:   4  8  22  7 12
    4:   5 10  22  7 12
    5:   5 10  23 10 16
    6:   6 12  22  7 12
    7:   6 12  23 10 16
    8:   7 14  23 10 16
    9:   8 16  23 10 16
    
    0 讨论(0)
  • 2021-01-05 14:29

    Recently, overlap joins are implemented in data.table. This is a special case where dt1's `start and end points are identical. You can grab the latest version from the github project page to try this out:

    require(data.table) ## 1.9.3+
    dt1[, DD := D] ## duplicate column D to create intervals
    setkey(dt2, D1,D2) ## key needs to be set for 2nd argument
    foverlaps(dt1, dt2, by.x=c("D", "DD"), by.y=key(dt2), nomatch=0L)
    
    #    id2 D1 D2 id1  D DD
    # 1:  21  5  9   3  6  6
    # 2:  21  5  9   4  8  8
    # 3:  22  7 12   4  8  8
    # 4:  22  7 12   5 10 10
    # 5:  23 10 16   5 10 10
    # 6:  22  7 12   6 12 12
    # 7:  23 10 16   6 12 12
    # 8:  23 10 16   7 14 14
    # 9:  23 10 16   8 16 16
    

    Here's the results benchmarking on the same data you've shown in your post:

    # Unit: seconds
    #             expr         min          lq      median          uq         max neval
    #            olaps  0.03600603  0.03971068  0.04341533  0.04857602  0.05373671     3
    #  bioTreeRangeRes  0.11356837  0.11673968  0.11991100  0.12499391  0.13007681     3
    #          dtJoin2  2.61679908  2.70327940  2.78975971  2.86864832  2.94753693     3
    #           fullCJ  4.45173294  4.75271285  5.05369275  5.08333291  5.11297307     3
    #          dtJoin1 16.51898878 17.39207632 18.26516387 18.60092303 18.93668220     3
    #       manualIter 29.36023340 30.13354967 30.90686594 33.55910653 36.21134712     3
    

    where dt_olaps is:

    dt_olaps <- function(dt1, dt2) {
        dt1[, DD := D]
        setkey(dt2, D1,D2)
        foverlaps(dt1, dt2, by.x=c("D","DD"), by.y=key(dt2), nomatch=0L)
    }
    
    0 讨论(0)
提交回复
热议问题