Sort a data.table fast by Ascending/Descending order

后端 未结 2 449
说谎
说谎 2020-11-30 21:38

I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:

<         


        
相关标签:
2条回答
  • 2020-11-30 21:44

    The comment was mine, so I'll post the answer. I removed it because I couldn't test whether it was equivalent to what you already had. Glad to hear it's faster.

    X <- X[order(Year, MemberID, -Month)]
    

    Summarizing shouldn't depend on the order of your rows.

    0 讨论(0)
  • 2020-11-30 21:51

    Update June 5 2014:

    The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder and setorderv, which does exactly what you require. These functions reorder the data.table by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder for more info.

    In addition, DT[order(.)] is also by default optimised to use data.table's internal fast order instead of base:::order. This, unlike setorder, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.

    Benchmarks:

    Here's an illustration on the speed differences using setorder, data.table's internal fast order and with base:::order:

    require(data.table) ## 1.9.3
    set.seed(1L)
    DT <- data.table(Year     = sample(1950:2000, 3e6, TRUE), 
                     memberID = sample(paste0("V", 1:1e4), 3e6, TRUE), 
                     month    = sample(12, 3e6, TRUE))
    
    ## using base:::order
    system.time(ans1 <- DT[base:::order(Year, memberID, -month)])
    #   user  system elapsed 
    # 76.909   0.262  81.266 
    
    ## optimised to use data.table's fast order
    system.time(ans2 <- DT[order(Year, memberID, -month)])
    #   user  system elapsed 
    #  0.985   0.030   1.027
    
    ## reorders by reference
    system.time(setorder(DT, Year, memberID, -month))
    #   user  system elapsed 
    #  0.585   0.013   0.600 
    
    ## or alternatively
    ## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))
    
    ## are they equal?
    identical(ans2, DT)    # [1] TRUE
    identical(ans1, ans2)  # [1] TRUE
    

    On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order and setorder is ~135x faster than base:::order here.

    data.table always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)].

    All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.


    NOTE: Please refer to the history/revisions for earlier answer and updates.

    0 讨论(0)
提交回复
热议问题