Tag all duplicate rows in R as in Stata

后端 未结 2 1121
眼角桃花
眼角桃花 2021-01-13 02:55

Following up from my question here, I am trying to replicate in R the functionality of the Stata command duplicates tag, which allows me to tag all the rows of

2条回答
  •  夕颜
    夕颜 (楼主)
    2021-01-13 03:18

    I don't really have an answer to your three questions, but I can save you some time. I also split time between Stata and R and often miss Stata's duplicates commands. But if you subset then merge with all=TRUE, then you can save a lot of time.

    Here's an example.

    # my more Stata-ish approach
    system.time({
        dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
        dupes$dup <- 1
        dfTemp2 <- merge(dfDup, dupes, all=TRUE)
        dfTemp2$dup <- ifelse(is.na(dfTemp2$dup), 0, dfTemp2$dup)
    })
    

    This is quite a bit faster.

    > system.time({
    + fnDupTag = function(dfX, indexVars) {
    +   dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
    +     if(nrow(x) > 1) x .... [TRUNCATED] 
       user  system elapsed 
     118.75    0.22  120.11 
    
    > # my more Stata-ish approach
    > system.time({
    +     dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
    +     dupes$dup <- 1
    +     dfTemp2 <- merge(dfDup,  .... [TRUNCATED] 
       user  system elapsed 
       0.63    0.00    0.63 
    

    With identical results (subject to all.equal's precision).

    > # compare
    > dfTemp <- dfTemp[with(dfTemp, order(f1, f2, f3, f4, data)), ]
    
    > dfTemp2 <- dfTemp2[with(dfTemp2, order(f1, f2, f3, f4, data)), ]
    > all.equal(dfTemp, dfTemp2)
    [1] "Attributes: < Component 2: Mean relative difference: 1.529748e-05 >"
    

提交回复
热议问题