I have a large dataframe and I want to check whether the values a set of (factor) variables uniquely identifies each row of the data or not.
My current strategy is t
Perhaps anyDuplicated
:
anyDuplicated( dfTemp[, c("Var1", "Var2", "Var3") ] )
or using dplyr:
dfTemp %.% select(Var1, Var2, Var3) %.% anyDuplicated()
This is still going to be wasteful though because anyDuplicated
will first paste the columns into a character vector.
How about:
length(unique(paste(dfTemp$var1, dfTemp$var2, dfTemp$var3)))==nrow(dfTemp)
Paste variables into one string, get unique, and compare the length of this vector with number of rows in your dataframe.
The data.table
package provides very fast duplicated
and unique
methods for data.table
s. It also has a by=
argument where you can provide the columns on which the duplicated/unique results should be computed from.
Here's an example of a large data.frame:
require(data.table)
set.seed(45L)
## use setDT(dat) if your data is a data.frame,
## to convert it to a data.table by reference
dat <- data.table(var1=sample(100, 1e7, TRUE),
var2=sample(letters, 1e7, TRUE),
var3=sample(as.numeric(sample(c(-100:100, NA), 1e7,TRUE))))
system.time(any(duplicated(dat)))
# user system elapsed
# 1.632 0.007 1.671
This takes 25 seconds using anyDuplicated.data.frame
.
# if you want to calculate based on just var1 and var2
system.time(any(duplicated(dat, by=c("var1", "var2"))))
# user system elapsed
# 0.492 0.001 0.495
This takes 7.4 seconds using anyDuplicated.data.frame
.