How do I do a negative / nomatch / inverse search in data.table?

前端 未结 2 696
囚心锁ツ
囚心锁ツ 2020-12-29 06:00

What happens if I want to select all the rows in a data.table that do not contain a particular value in the key variable using binary search? By the way, what is the correct

相关标签:
2条回答
  • 2020-12-29 06:29

    Andrie's answer is great, and is what I'd probably use. Interestingly, though, the following construct seems to be (just a bit) faster, especially as the size of the data.tables increase.

    DT[J(x = unique(DT)[x!="a"][,x])]
    
    ##-------------------------------- Timings -----------------------------------##
    
    library(data.table)
    library(rbenchmark)
    
    DT = data.table(x=rep(c("a","b","c"),each=45e5), y=c(1,3,6), v=1:9, key="x")
    Josh <- function() DT[J(x = unique(DT)[x!="a"][,x])]
    Andrie <- function() DT[-DT["a", which=TRUE]]
    
    ## Compare results
    identical(Josh(), setkey(Andrie(), "x"))  
    # [1] TRUE
    
    ## Compare timings
    benchmark(replications = 10, order="relative", Josh=Josh(), Andrie=Andrie())
        test replications elapsed relative user.self sys.self user.child sys.child
    1   Josh           10   17.50    1.000     14.78      3.6         NA        NA
    2 Andrie           10   18.75    1.071     16.52      3.2         NA        NA
    

    I'd be especially tempted to use this if DT[,x] could be made to return a data.table rather than a vector. Then, the construct could be simplified a bit to DT[unique(DT[,x])[x!="a"]]. Also, it would then work even when there are mulitiple columns in the key, which it currently does not.

    0 讨论(0)
  • 2020-12-29 06:43

    The idiom is this:

    DT[-DT["a", which=TRUE]]
    
       x y v
    1: b 1 4
    2: b 3 5
    3: b 6 6
    4: c 1 7
    5: c 3 8
    6: c 6 9
    

    Inspiration from:

    • The mailing list posting Return Select/Join that does NOT match?
    • The previous question non-joins with data.tables
    • Matthew Dowle's answer to Porting set operations from R's data frames to data tables: How to identify duplicated rows?

    Update. New in v1.8.3 is not-join syntax. Farrel's first expectation (! rather than -) has been implemented :

    DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
    DT[!"a",...]                            # same result, now preferred.
    

    See the NEWS item for more detailed info and example.

    0 讨论(0)
提交回复
热议问题