Tag all duplicate rows in R as in Stata

后端未结

关注

 2  1127

眼角桃花 2021-01-13 02:55

Following up from my question here, I am trying to replicate in R the functionality of the Stata command duplicates tag, which allows me to tag all the rows of

2条回答

夕颜 (楼主)

2021-01-13 03:18

I don't really have an answer to your three questions, but I can save you some time. I also split time between Stata and R and often miss Stata's duplicates commands. But if you subset then merge with all=TRUE, then you can save a lot of time.

Here's an example.

# my more Stata-ish approach
system.time({
    dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
    dupes$dup <- 1
    dfTemp2 <- merge(dfDup, dupes, all=TRUE)
    dfTemp2$dup <- ifelse(is.na(dfTemp2$dup), 0, dfTemp2$dup)
})

This is quite a bit faster.

> system.time({
+ fnDupTag = function(dfX, indexVars) {
+   dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
+     if(nrow(x) > 1) x .... [TRUNCATED] 
   user  system elapsed 
 118.75    0.22  120.11 

> # my more Stata-ish approach
> system.time({
+     dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
+     dupes$dup <- 1
+     dfTemp2 <- merge(dfDup,  .... [TRUNCATED] 
   user  system elapsed 
   0.63    0.00    0.63

With identical results (subject to all.equal's precision).

> # compare
> dfTemp <- dfTemp[with(dfTemp, order(f1, f2, f3, f4, data)), ]

> dfTemp2 <- dfTemp2[with(dfTemp2, order(f1, f2, f3, f4, data)), ]
> all.equal(dfTemp, dfTemp2)
[1] "Attributes: < Component 2: Mean relative difference: 1.529748e-05 >"

0 讨论(0)

查看其它2个回答