find *all* duplicated records in data.table (not all-but-one)

前端 未结 4 495
Happy的楠姐
Happy的楠姐 2020-12-15 03:27

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn\'t contain first occurrence of duplicated reco

相关标签:
4条回答
  • 2020-12-15 03:54

    This appears to work:

    > myDT[unique(myDT),fD:=.N>1]
    > myDT
       id  fB fC    fD
    1:  1  b1 c1  TRUE
    2:  3  b1 c1  TRUE
    3:  5  b1 c1  TRUE
    4:  2  b2 c2 FALSE
    5:  4  b3 c3 FALSE
    

    Thanks to @flodel, the better way to do it is this:

    > myDT[, fD := .N > 1, by = key(myDT)]
    > myDT
       id  fB fC    fD
    1:  1  b1 c1  TRUE
    2:  3  b1 c1  TRUE
    3:  5  b1 c1  TRUE
    4:  2  b2 c2 FALSE
    5:  4  b3 c3 FALSE
    

    The difference in efficiency is substantial:

    > microbenchmark(
        key=myDT[, fD := .N > 1, by = key(myDT)],
        unique=myDT[unique(myDT),fD:=.N>1])
    Unit: microseconds
       expr      min       lq    median        uq       max neval
        key  679.874  715.700  735.0575  773.7595  1825.437   100
     unique 1417.845 1485.913 1522.7475 1567.9065 24053.645   100
    

    Especially for the max. What's going on there?

    0 讨论(0)
  • 2020-12-15 03:55

    A third approach, (that appears more efficient for this small example)

    You can explicitly call duplicated.data.frame....

    myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
      .SDcols = key(myDT)]
    
    
     microbenchmark(
       key=myDT[, fD := .N > 1, by = key(myDT)],
       unique=myDT[unique(myDT),fD:=.N>1], 
      dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE), 
        .SDcols = key(myDT)])
    ## Unit: microseconds
    ##    expr      min        lq   median        uq       max neval
    ##     key  556.608  575.9265  588.906  600.9795 27713.242   100
    ##  unique 1112.913 1164.8310 1183.244 1216.9000  2263.557   100
    ##     dup  420.173  436.3220  448.396  461.3750   699.986   100
    

    If we expand the size of the sample data.table, then the key approach is the clear winner

     myDT <- data.table(id = sample(1e6), 
      fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
      fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
    setkeyv(myDT, c('fB', 'fC'))
    
    microbenchmark(
      key=myDT[, fD := .N > 1, by = key(myDT)],
      unique=myDT[unique(myDT),fD:=.N>1], 
      dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
       .SDcols = key(myDT)],times=10)
    ## Unit: milliseconds
    ##    expr       min        lq    median        uq       max neval
    ##     key  355.9258  358.1764  360.7628  450.9218  500.8360    10
    ##  unique  451.3794  458.0258  483.3655  519.3341  553.2515    10
    ##     dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012    10
    
    0 讨论(0)
  • 2020-12-15 03:58

    Many years ago this was the fastest answer by a large margin (see revision history if interested):

    dups = duplicated(myDT, by = key(myDT));
    myDT[, fD := dups | c(tail(dups, -1), FALSE)]
    

    There have been a lot of internal changes since then however, that have made many options about the same order:

    myDT <- data.table(id = sample(1e6), 
                       fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
                       fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
    setkey(myDT, fB, fC)
    
    microbenchmark(
       key=myDT[, fD := .N > 1, by = key(myDT)],
       unique=myDT[unique(myDT, by = key(myDT)),fD:=.N>1], 
       dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
                  .SDcols = key(myDT)],
       dup2 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(tail(dups, -1L), FALSE)]},
       dup3 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(dups[-1L], FALSE)]},
       times=10)
    
    #   expr       min        lq      mean    median        uq       max neval
    #    key  523.3568  567.5372  632.2379  578.1474  678.4399  886.8199    10
    # unique  189.7692  196.0417  215.4985  210.5258  224.4306  290.2597    10
    #    dup 4440.8395 4685.1862 4786.6176 4752.8271 4900.4952 5148.3648    10
    #   dup2  143.2756  153.3738  236.4034  161.2133  318.1504  419.4082    10
    #   dup3  144.1497  150.9244  193.3058  166.9541  178.0061  460.5448    10
    
    0 讨论(0)
  • 2020-12-15 04:02

    As of data.table version 1.9.8, the solution by eddi needs to be modified to be:

    dups = duplicated(myDT, by = key(myDT));
    myDT[, fD := dups | c(tail(dups, -1), FALSE)]
    

    since:

    Changes in v1.9.8 (on CRAN 25 Nov 2016)

    POTENTIALLY BREAKING CHANGES

    By default all columns are now used by unique(), duplicated() and uniqueN() data.table methods, #1284 and #1841. To restore old behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicitly pass by=key(DT) for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked.

    0 讨论(0)
提交回复
热议问题