if I understand correctly, duplicated()
function for data.table
returns a logical vector which doesn\'t contain first occurrence of duplicated reco
This appears to work:
> myDT[unique(myDT),fD:=.N>1]
> myDT
id fB fC fD
1: 1 b1 c1 TRUE
2: 3 b1 c1 TRUE
3: 5 b1 c1 TRUE
4: 2 b2 c2 FALSE
5: 4 b3 c3 FALSE
Thanks to @flodel, the better way to do it is this:
> myDT[, fD := .N > 1, by = key(myDT)]
> myDT
id fB fC fD
1: 1 b1 c1 TRUE
2: 3 b1 c1 TRUE
3: 5 b1 c1 TRUE
4: 2 b2 c2 FALSE
5: 4 b3 c3 FALSE
The difference in efficiency is substantial:
> microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1])
Unit: microseconds
expr min lq median uq max neval
key 679.874 715.700 735.0575 773.7595 1825.437 100
unique 1417.845 1485.913 1522.7475 1567.9065 24053.645 100
Especially for the max. What's going on there?
A third approach, (that appears more efficient for this small example)
You can explicitly call duplicated.data.frame
....
myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)]
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)])
## Unit: microseconds
## expr min lq median uq max neval
## key 556.608 575.9265 588.906 600.9795 27713.242 100
## unique 1112.913 1164.8310 1183.244 1216.9000 2263.557 100
## dup 420.173 436.3220 448.396 461.3750 699.986 100
If we expand the size of the sample data.table
, then the key
approach is the clear winner
myDT <- data.table(id = sample(1e6),
fB = sample(seq_len(1e3), size= 1e6, replace=TRUE),
fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkeyv(myDT, c('fB', 'fC'))
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)],times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## key 355.9258 358.1764 360.7628 450.9218 500.8360 10
## unique 451.3794 458.0258 483.3655 519.3341 553.2515 10
## dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012 10
Many years ago this was the fastest answer by a large margin (see revision history if interested):
dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]
There have been a lot of internal changes since then however, that have made many options about the same order:
myDT <- data.table(id = sample(1e6),
fB = sample(seq_len(1e3), size= 1e6, replace=TRUE),
fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkey(myDT, fB, fC)
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT, by = key(myDT)),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)],
dup2 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(tail(dups, -1L), FALSE)]},
dup3 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(dups[-1L], FALSE)]},
times=10)
# expr min lq mean median uq max neval
# key 523.3568 567.5372 632.2379 578.1474 678.4399 886.8199 10
# unique 189.7692 196.0417 215.4985 210.5258 224.4306 290.2597 10
# dup 4440.8395 4685.1862 4786.6176 4752.8271 4900.4952 5148.3648 10
# dup2 143.2756 153.3738 236.4034 161.2133 318.1504 419.4082 10
# dup3 144.1497 150.9244 193.3058 166.9541 178.0061 460.5448 10
As of data.table version 1.9.8, the solution by eddi needs to be modified to be:
dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]
since:
Changes in v1.9.8 (on CRAN 25 Nov 2016)
POTENTIALLY BREAKING CHANGES
By default all columns are now used by unique(), duplicated() and uniqueN() data.table methods, #1284 and #1841. To restore old behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicitly pass by=key(DT) for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked.