For example, I have a table as follows:
DT <- data.table(
A = c(1,1,1,2,2,2,3,3,3),
B = c(1,2,3,1,2,3,1,2,3),
key = \"A\"
)
I wand to
Another method for the case where there are only 2 columns, using anti-join.
dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
ans <- unique(DT)[!dupes, on=.(A, B)]
timing code:
library(data.table)
set.seed(0L)
nr <- 1e5
nElem <- 1e3
mat <- matrix(sample(nElem, nr*2, replace=TRUE), ncol=2)
DT <- as.data.table(mat)
setnames(DT, c("A", "B"))
DT2 <- copy(DT)
library(microbenchmark)
mtd1 <- function() unique(data.frame(A=pmin(mat[, 1], mat[, 2]), B=pmax(mat[, 1], mat[, 2])))
mtd2 <- function() DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
mtd3 <- function() DT2[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]
mtd4 <- function() {
dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
ans <- unique(DT)[!dupes, on=.(A, B)]
}
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),times=3L)
some timings:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 118.62051 129.50581 153.77216 140.39111 171.34799 202.30487 3
mtd2() 3500.47877 3552.80879 3732.67006 3605.13882 3848.76571 4092.39260 3
mtd3() 89.22901 92.94830 97.22658 96.66759 101.22536 105.78313 3
mtd4() 28.61628 32.37641 50.90126 36.13654 62.04375 87.95096 3
But the fastest is eddi's method: data.table with two string columns of set elements, extract unique rows with each row unsorted
mtd5 <- function() DT[DT[, .I[1L], by=.(pmin(A, B), pmax(A, B))]$V1]
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),mtd5(),times=3L)
timings:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 149.62224 150.70685 175.66394 151.79146 188.68479 225.57813 3
mtd2() 4126.51014 4140.72876 4277.37907 4154.94738 4352.81353 4550.67968 3
mtd3() 126.01679 131.26463 134.63642 136.51247 138.94624 141.38000 3
mtd4() 39.24141 42.42815 45.65804 45.61489 48.86635 52.11781 3
mtd5() 12.58396 16.68156 18.21613 20.77915 21.03221 21.28527 3
Maybe not the most efficient, but leverage the duplicated.matrix
method:
DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
# A B
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 2
# 5: 2 3
# 6: 3 3
if you only have two columns, then you could do:
unique(do.call(function(A,B)data.table(A=pmin(A,B),B=pmax(A,B)),DT))
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3
Another option:
DT[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3
So ...
The last step could alternately be unique(DT, by="g")
.