find *all* duplicated records in data.table (not all-but-one)

前端 未结 4 494
Happy的楠姐
Happy的楠姐 2020-12-15 03:27

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn\'t contain first occurrence of duplicated reco

4条回答
  •  囚心锁ツ
    2020-12-15 03:54

    This appears to work:

    > myDT[unique(myDT),fD:=.N>1]
    > myDT
       id  fB fC    fD
    1:  1  b1 c1  TRUE
    2:  3  b1 c1  TRUE
    3:  5  b1 c1  TRUE
    4:  2  b2 c2 FALSE
    5:  4  b3 c3 FALSE
    

    Thanks to @flodel, the better way to do it is this:

    > myDT[, fD := .N > 1, by = key(myDT)]
    > myDT
       id  fB fC    fD
    1:  1  b1 c1  TRUE
    2:  3  b1 c1  TRUE
    3:  5  b1 c1  TRUE
    4:  2  b2 c2 FALSE
    5:  4  b3 c3 FALSE
    

    The difference in efficiency is substantial:

    > microbenchmark(
        key=myDT[, fD := .N > 1, by = key(myDT)],
        unique=myDT[unique(myDT),fD:=.N>1])
    Unit: microseconds
       expr      min       lq    median        uq       max neval
        key  679.874  715.700  735.0575  773.7595  1825.437   100
     unique 1417.845 1485.913 1522.7475 1567.9065 24053.645   100
    

    Especially for the max. What's going on there?

提交回复
热议问题