Tag all duplicate rows in R as in Stata

后端 未结 2 1120
眼角桃花
眼角桃花 2021-01-13 02:55

Following up from my question here, I am trying to replicate in R the functionality of the Stata command duplicates tag, which allows me to tag all the rows of

相关标签:
2条回答
  • 2021-01-13 03:18

    I don't really have an answer to your three questions, but I can save you some time. I also split time between Stata and R and often miss Stata's duplicates commands. But if you subset then merge with all=TRUE, then you can save a lot of time.

    Here's an example.

    # my more Stata-ish approach
    system.time({
        dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
        dupes$dup <- 1
        dfTemp2 <- merge(dfDup, dupes, all=TRUE)
        dfTemp2$dup <- ifelse(is.na(dfTemp2$dup), 0, dfTemp2$dup)
    })
    

    This is quite a bit faster.

    > system.time({
    + fnDupTag = function(dfX, indexVars) {
    +   dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
    +     if(nrow(x) > 1) x .... [TRUNCATED] 
       user  system elapsed 
     118.75    0.22  120.11 
    
    > # my more Stata-ish approach
    > system.time({
    +     dupes <- dfDup[duplicated(dfDup[, 1:4]), 1:4]
    +     dupes$dup <- 1
    +     dfTemp2 <- merge(dfDup,  .... [TRUNCATED] 
       user  system elapsed 
       0.63    0.00    0.63 
    

    With identical results (subject to all.equal's precision).

    > # compare
    > dfTemp <- dfTemp[with(dfTemp, order(f1, f2, f3, f4, data)), ]
    
    > dfTemp2 <- dfTemp2[with(dfTemp2, order(f1, f2, f3, f4, data)), ]
    > all.equal(dfTemp, dfTemp2)
    [1] "Attributes: < Component 2: Mean relative difference: 1.529748e-05 >"
    
    0 讨论(0)
  • 2021-01-13 03:31

    I'll answer your third question here.. (I think the first question is more or less answered in your other post).

    ## Assuming DT is your data.table
    DT[, dupvar := 1L*(.N > 1L), by=c(indexVars)]
    

    := adds a new column dupvar by reference (and is therefore very fast because no copies are made). .N is a special variable within data.table, that provides the number of observations that belong to each group (here, for every f1,f2,f3,f4).

    Take your time and go through ?data.table (and run the examples there) to understand the usage. It'll save you a lot of time later on.

    So, basically, we group by indexVars, check if .N > 1L and if it's the case, it'd return TRUE. We multiply by 1L to return an integer instead of logical value.

    If you require, you can also sort it by the by-columns using setkey.


    From the next version on (currently implemented in v1.9.3 - development version), there's also a function setorder that's exported that just sorts the data.table by reference, without setting keys. It also can sort in ascending or descending order. (Note that setkey always sorts in ascending order only).

    That is, in the next version you can do:

    setorder(DT, f1, f2, f3, f4)
    ## or equivalently
    setorderv(DT, c("f1", "f2", "f3", "f4"))
    

    In addition, the usage DT[order(...)] is also optimised internally to use data.table's fast ordering. That is, DT[order(...)] is detected internally and changed to DT[forder(DT, ...)] which is incredibly faster than base's order. So, if you don't want to change it by reference, and want to assign the sorted data.table on to another variable, you can just do:

    DT_sorted <- DT[order(f1, f2, f3, f4)] ## internally optimised for speed
                                           ## but still copies!
    

    HTH

    0 讨论(0)
提交回复
热议问题