find all duplicated records in data.table (not all-but-one)

前端未结

关注

 4  494

Happy的楠姐 2020-12-15 03:27

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn\'t contain first occurrence of duplicated reco

4条回答

囚心锁ツ (楼主)

2020-12-15 03:54

This appears to work:

> myDT[unique(myDT),fD:=.N>1]
> myDT
   id  fB fC    fD
1:  1  b1 c1  TRUE
2:  3  b1 c1  TRUE
3:  5  b1 c1  TRUE
4:  2  b2 c2 FALSE
5:  4  b3 c3 FALSE

Thanks to @flodel, the better way to do it is this:

> myDT[, fD := .N > 1, by = key(myDT)]
> myDT
   id  fB fC    fD
1:  1  b1 c1  TRUE
2:  3  b1 c1  TRUE
3:  5  b1 c1  TRUE
4:  2  b2 c2 FALSE
5:  4  b3 c3 FALSE

The difference in efficiency is substantial:

> microbenchmark(
    key=myDT[, fD := .N > 1, by = key(myDT)],
    unique=myDT[unique(myDT),fD:=.N>1])
Unit: microseconds
   expr      min       lq    median        uq       max neval
    key  679.874  715.700  735.0575  773.7595  1825.437   100
 unique 1417.845 1485.913 1522.7475 1567.9065 24053.645   100

Especially for the max. What's going on there?

0 讨论(0)

查看其它4个回答

find *all* duplicated records in data.table (not all-but-one)

find all duplicated records in data.table (not all-but-one)