I am performing a per policy life insurance valuation in R. Monthly cash flow projections are performed per policy and returns a data frame in the following format (for exam
With data.table
syntax the one step approach would be to create the big data.table first and then do the aggregation:
library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
OneBigDataFrame[, lapply(.SD, sum), by = ProjM]
or, more concise
rbindlist(ListOfDataFrames)[, lapply(.SD, sum), by = ProjM]
ProjM Cashflow1 Cashflow2 1: 1 55 -35 2: 2 55 -35 3: 3 55 -35 4: 4 5 10 5: 5 5 10
Now, the OP has requested to avoid creating the big data.table first in order to save memory. This requires a two step approach where the aggregates are computed for each data.table which are then aggregated to a grand total in the final step:
rbindlist(
lapply(ListOfDataFrames,
function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
)[, lapply(.SD, sum), by = ProjM]
ProjM Cashflow1 Cashflow2 1: 1 55 -35 2: 2 55 -35 3: 3 55 -35 4: 4 5 10 5: 5 5 10
Note that setDT()
is used here to coerce the data.frames to data.table by reference, i.e., without creating an additional copy which saves time and memory.
Using the benchmark data of d.b (list of 10000 data.frames with 100 rows each, 28.5 Mb in total) with all answers provided so far:
mb <- microbenchmark::microbenchmark(
malan = {
OneBigDataFrame <- rbindlist(test)
malan <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
},
d.b = d.b <- with(data = data.frame(do.call(dplyr::bind_rows, test)),
expr = aggregate(x = list(Cashflow1 = Cashflow1, Cashflow2 = Cashflow2),
by = list(ProjM = ProjM),
FUN = sum)),
a.gore = {
newagg <- function(dataset) {
dataset <- data.table(dataset)
dataset <- dataset[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]
return(dataset)
}
a.gore <- newagg(rbindlist(lapply(test,newagg)))
},
dt1 = dt1 <- rbindlist(test)[, lapply(.SD, sum), by = ProjM],
dt2 = dt2 <- rbindlist(
lapply(test,
function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
)[, lapply(.SD, sum), by = ProjM],
times = 5L
)
mb
Unit: milliseconds expr min lq mean median uq max neval cld malan 565.43967 583.08300 631.15898 600.45790 605.60237 801.2120 5 b d.b 707.50261 710.31127 719.25591 713.54526 721.26691 743.6535 5 b a.gore 14706.40442 14747.76305 14861.61641 14778.88547 14805.29412 15269.7350 5 d dt1 40.10061 40.92474 42.27034 41.55434 42.07951 46.6925 5 a dt2 8806.85039 8846.47519 9144.00399 9295.29432 9319.17251 9452.2275 5 c
The fastest solution is the one step approach using data.table
which is 15 times faster than the second fastest. Surprisingly, the two step data.table
approaches are magnitudes slower than the one step approach.
To make sure that all solutions return the same result this can be checked using
all.equal(malan, d.b)
all.equal(malan, as.data.frame(a.gore))
all.equal(malan, as.data.frame(dt1))
all.equal(malan, as.data.frame(dt2))
which return TRUE
in all cases.