Following up from my question here, I am trying to replicate in R the functionality of the Stata command duplicates tag
, which allows me to tag all the rows of
I don't really have an answer to your three questions, but I can save you some time. I also split time between Stata and R and often miss Stata's duplicates
commands. But if you subset
then merge
with all=TRUE
, then you can save a lot of time.
Here's an example.
# my more Stata-ish approach
system.time({
dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
dupes$dup <- 1
dfTemp2 <- merge(dfDup, dupes, all=TRUE)
dfTemp2$dup <- ifelse(is.na(dfTemp2$dup), 0, dfTemp2$dup)
})
This is quite a bit faster.
> system.time({
+ fnDupTag = function(dfX, indexVars) {
+ dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
+ if(nrow(x) > 1) x .... [TRUNCATED]
user system elapsed
118.75 0.22 120.11
> # my more Stata-ish approach
> system.time({
+ dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
+ dupes$dup <- 1
+ dfTemp2 <- merge(dfDup, .... [TRUNCATED]
user system elapsed
0.63 0.00 0.63
With identical results (subject to all.equal
's precision).
> # compare
> dfTemp <- dfTemp[with(dfTemp, order(f1, f2, f3, f4, data)), ]
> dfTemp2 <- dfTemp2[with(dfTemp2, order(f1, f2, f3, f4, data)), ]
> all.equal(dfTemp, dfTemp2)
[1] "Attributes: < Component 2: Mean relative difference: 1.529748e-05 >"