Fast EXISTS in data.table

后端 未结 2 1717
伪装坚强ぢ
伪装坚强ぢ 2021-02-10 05:16

What is the fastest way to check if a value exists in a data.table?. Suppose that

  • dt is a data.table of n columns with k columns being the key
  • keys is a l
相关标签:
2条回答
  • 2021-02-10 05:39

    Short answer: In addition to nomatch=0, I think mult="first" would help speed it even more.

    Long answer: Assuming that you want to check if a value (or more than 1 value) is present in the key column of a data.table or not, this seems to be much faster. The only assumption here is that the data.table has only 1 key column (as this is quite ambiguous to me).

    my.values = c(1:100, 1000)
    require(data.table)
    set.seed(45)
    DT <- as.data.table(matrix(sample(2e4, 1e6*100, replace=TRUE), ncol=100))
    setkey(DT, "V1")
    # the data.table way
    system.time(all(my.values %in% .subset2(DT[J(my.values), mult="first", nomatch=0], "V1")))
       user  system elapsed 
      0.006   0.000   0.006 
    
    # vector (scan) approach
    system.time(all(my.values %in% .subset2(DT, "V1")))
       user  system elapsed 
      0.037   0.000   0.038 
    

    You can change all to any if you want to check if at least 1 value is present in the subset or not. The only difference between the two is that you first subset using data.table's approach (taking advantage of key and mult argument). As you can see the it's extremely faster (and also scales well). And then to retrieve the key columns from the subset (call it the_subset),

    .subset2(the_subset, "V1") (or) the_subset$V1 (or) the_subset[["V1"]]
    

    But, the_subset[, V1] will be slower.

    Of course the same idea could be extended to many columns as well, but I'll have to know exactly what you want to do after.

    0 讨论(0)
  • 2021-02-10 05:43

    How about the base R idiom:

    any(my.value %in% my.vector)
    

    This is not a data.table specific idiom but is quite efficient I believe.

    0 讨论(0)
提交回复
热议问题