I have been using a small tab
function for some time, which shows the frequency, percent, and cumulative percent for a vector. The output looks like this
As I'm a big fan of library(data.table)
I wrote similar function:
tabdt <- function(x){
n <- length(which(!is.na(x)))
dt <- data.table(x)
out <- dt[, list(Freq = .N, Percent = .N / n), by = x]
out[!is.na(x), CumSum := cumsum(Percent)]
out
}
> benchmark(tabdt(x1), tab2(x1), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x1) 5.60 1.879
1 tabdt(x1) 2.98 1.000
> benchmark(tabdt(x2), tab2(x2), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x2) 6.34 1.686
1 tabdt(x2) 3.76 1.000
> benchmark(tabdt(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x3) 1.65 1.000
1 tabdt(x3) 2.34 1.418
> benchmark(tabdt(x4), tab2(x4), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x4) 14.35 1.000
1 tabdt(x4) 22.04 1.536
And so data.table
approach was faster for x1
and x2
while dplyr
was faster for x3
and x4
. Actually I don't see any room for improvement using these approaches.
p.s. Would you add data.table
keyword to this question? I believe people would love to see dplyr
vs. data.table
performance comparison (see data.table vs dplyr: can one do something well the other can't or does poorly? for example).