Aggregate variables in list of data frames into single data frame

后端 未结 2 1947
野趣味
野趣味 2021-01-07 07:03

I am performing a per policy life insurance valuation in R. Monthly cash flow projections are performed per policy and returns a data frame in the following format (for exam

相关标签:
2条回答
  • 2021-01-07 07:18

    I think this solution might be efficient. Give it a try and let me know

    require(data.table)
    newagg <- function(dataset) { dataset <- data.table(dataset);dataset <- dataset[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]; return(dataset)}
    newagg(rbindlist(lapply(ListOfDataFrames,newagg)))
    # ProjM Cashflow1 Cashflow2
    # 1:     1        55       -35
    # 2:     2        55       -35
    # 3:     3        55       -35
    # 4:     4         5        10
    # 5:     5         5        10
    
    0 讨论(0)
  • 2021-01-07 07:32

    With data.table syntax the one step approach would be to create the big data.table first and then do the aggregation:

    library(data.table)
    OneBigDataFrame <- rbindlist(ListOfDataFrames)
    OneBigDataFrame[, lapply(.SD, sum), by = ProjM]
    

    or, more concise

    rbindlist(ListOfDataFrames)[, lapply(.SD, sum), by = ProjM]
    
       ProjM Cashflow1 Cashflow2
    1:     1        55       -35
    2:     2        55       -35
    3:     3        55       -35
    4:     4         5        10
    5:     5         5        10
    

    Now, the OP has requested to avoid creating the big data.table first in order to save memory. This requires a two step approach where the aggregates are computed for each data.table which are then aggregated to a grand total in the final step:

    rbindlist(
      lapply(ListOfDataFrames, 
             function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
      )[, lapply(.SD, sum), by = ProjM]
    
       ProjM Cashflow1 Cashflow2
    1:     1        55       -35
    2:     2        55       -35
    3:     3        55       -35
    4:     4         5        10
    5:     5         5        10
    

    Note that setDT() is used here to coerce the data.frames to data.table by reference, i.e., without creating an additional copy which saves time and memory.

    Benchmark

    Using the benchmark data of d.b (list of 10000 data.frames with 100 rows each, 28.5 Mb in total) with all answers provided so far:

    mb <- microbenchmark::microbenchmark(
      malan = {
        OneBigDataFrame <- rbindlist(test)
        malan <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
      },
      d.b = d.b <- with(data = data.frame(do.call(dplyr::bind_rows, test)),
                 expr = aggregate(x = list(Cashflow1 = Cashflow1, Cashflow2 = Cashflow2),
                                  by = list(ProjM = ProjM),
                                  FUN = sum)),
      a.gore = {
        newagg <- function(dataset) { 
          dataset <- data.table(dataset)
          dataset <- dataset[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]
          return(dataset)
        }
        a.gore <- newagg(rbindlist(lapply(test,newagg)))
      },
      dt1 = dt1 <- rbindlist(test)[, lapply(.SD, sum), by = ProjM],
      dt2 = dt2 <- rbindlist(
        lapply(test, 
               function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
      )[, lapply(.SD, sum), by = ProjM],
      times = 5L
    )
    mb
    
    Unit: milliseconds
       expr         min          lq        mean      median          uq        max neval  cld
      malan   565.43967   583.08300   631.15898   600.45790   605.60237   801.2120     5  b  
        d.b   707.50261   710.31127   719.25591   713.54526   721.26691   743.6535     5  b  
     a.gore 14706.40442 14747.76305 14861.61641 14778.88547 14805.29412 15269.7350     5    d
        dt1    40.10061    40.92474    42.27034    41.55434    42.07951    46.6925     5 a   
        dt2  8806.85039  8846.47519  9144.00399  9295.29432  9319.17251  9452.2275     5   c
    

    The fastest solution is the one step approach using data.table which is 15 times faster than the second fastest. Surprisingly, the two step data.table approaches are magnitudes slower than the one step approach.

    To make sure that all solutions return the same result this can be checked using

    all.equal(malan, d.b)
    all.equal(malan, as.data.frame(a.gore))
    all.equal(malan, as.data.frame(dt1))
    all.equal(malan, as.data.frame(dt2))
    

    which return TRUE in all cases.

    0 讨论(0)
提交回复
热议问题