fast frequency and percentage table with dplyr

前端未结

关注

 1  1858

I have been using a small tab function for some time, which shows the frequency, percent, and cumulative percent for a vector. The output looks like this

相关标签:

1条回答

予麋鹿

2020-12-30 15:55

As I'm a big fan of library(data.table) I wrote similar function:

tabdt <- function(x){
    n <- length(which(!is.na(x)))
    dt <- data.table(x)
    out <- dt[, list(Freq = .N, Percent = .N / n), by = x]
    out[!is.na(x), CumSum := cumsum(Percent)]
    out
}

> benchmark(tabdt(x1), tab2(x1), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x1)    5.60    1.879
1 tabdt(x1)    2.98    1.000
> benchmark(tabdt(x2), tab2(x2), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x2)    6.34    1.686
1 tabdt(x2)    3.76    1.000
> benchmark(tabdt(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x3)    1.65    1.000
1 tabdt(x3)    2.34    1.418
> benchmark(tabdt(x4), tab2(x4), replications=1000)[,c('test','elapsed','relative')]
       test elapsed relative
2  tab2(x4)   14.35    1.000
1 tabdt(x4)   22.04    1.536

And so data.table approach was faster for x1 and x2 while dplyr was faster for x3 and x4. Actually I don't see any room for improvement using these approaches.

p.s. Would you add data.table keyword to this question? I believe people would love to see dplyr vs. data.table performance comparison (see data.table vs dplyr: can one do something well the other can't or does poorly? for example).

0 讨论(0)