fast frequency and percentage table with dplyr

前端 未结 1 1858
一个人的身影
一个人的身影 2020-12-30 15:40

I have been using a small tab function for some time, which shows the frequency, percent, and cumulative percent for a vector. The output looks like this

<
相关标签:
1条回答
  • 2020-12-30 15:55

    As I'm a big fan of library(data.table) I wrote similar function:

    tabdt <- function(x){
        n <- length(which(!is.na(x)))
        dt <- data.table(x)
        out <- dt[, list(Freq = .N, Percent = .N / n), by = x]
        out[!is.na(x), CumSum := cumsum(Percent)]
        out
    }
    
    > benchmark(tabdt(x1), tab2(x1), replications=1000)[,c('test','elapsed','relative')]
           test elapsed relative
    2  tab2(x1)    5.60    1.879
    1 tabdt(x1)    2.98    1.000
    > benchmark(tabdt(x2), tab2(x2), replications=1000)[,c('test','elapsed','relative')]
           test elapsed relative
    2  tab2(x2)    6.34    1.686
    1 tabdt(x2)    3.76    1.000
    > benchmark(tabdt(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
           test elapsed relative
    2  tab2(x3)    1.65    1.000
    1 tabdt(x3)    2.34    1.418
    > benchmark(tabdt(x4), tab2(x4), replications=1000)[,c('test','elapsed','relative')]
           test elapsed relative
    2  tab2(x4)   14.35    1.000
    1 tabdt(x4)   22.04    1.536
    

    And so data.table approach was faster for x1 and x2 while dplyr was faster for x3 and x4. Actually I don't see any room for improvement using these approaches.

    p.s. Would you add data.table keyword to this question? I believe people would love to see dplyr vs. data.table performance comparison (see data.table vs dplyr: can one do something well the other can't or does poorly? for example).

    0 讨论(0)
提交回复
热议问题