Memory issue with foreach loop in R on Windows 8 (64-bit) (doParallel package)

前端 未结 3 860
夕颜
夕颜 2021-02-05 22:32

I\'m trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table. The table contains data fo

3条回答
  •  广开言路
    2021-02-05 23:28

    The answer requires the iterators package and use of isplit which is similar to split in that it breaks the main data object into chunks based on one or more factor columns. The foreach loop iterates through the chunks of data, passing only the subset out to the worker process rather than the whole table.

    So the differences in the code are as follows:

    library(iterators)
    dt.all = data.table(
        grp     = factor(rep(1:num.series, each  =num.periods)),  # grp column is a factor
        pd      = rep(1:num.periods, num.series), 
        y       = rnorm(num.series * num.periods),
        x1      = rnorm(num.series * num.periods),
        x2      = rnorm(num.series * num.periods)
    ) 
    
    results = 
        foreach(dt.sub = isplit(dt.all, dt.all$grp), .packages="data.table", .combine="rbind")  
        %dopar%
        {
            f_lm(dt.sub$value, dt.sub$key[[1]])
        }
    

    The result of the isplit is that dt.sub is now a list with 2 elements: the key is in itself a list of the values used to split and the value contains the subset as a data.table.

    Credit for this solution is given to a SO answer given by David and a response by Russell to my question on an excellent blog post about iterators.

    ------------------------------------ EDIT ------------------------------------

    To test the performance of isplitDT v isplit and rbindlist v rbind the following code was used:

    rm(list=ls())
    library(data.table) ; library(iterators)  ;   library(doParallel)
    num.series = 400
    num.periods = 2000
    dt.all = data.table(
        grp     = factor(rep(1:num.series,each=num.periods)), 
        pd      = rep(1:num.periods, num.series), 
        y       = rnorm(num.series * num.periods),
        x1      = rnorm(num.series * num.periods),
        x2      = rnorm(num.series * num.periods)
    )
    dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
    
    f_lm = function(dt.sub, grp) {
        my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
        coef = summary(my.model)$coefficients
        data.table(grp, variable = rownames(coef), coef)
    }
    
    registerDoParallel(8)
    
    isplitDT <- function(x, colname, vals) {
      colname <- as.name(colname)
      ival <- iter(vals)
      nextEl <- function() {
        val <- nextElem(ival)
        list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
      }
      obj <- list(nextElem=nextEl)
      class(obj) <- c('abstractiter', 'iter')
      obj
    }
    
    dtcomb <- function(...) {
      rbindlist(list(...))
    }
    
    # isplit/rbind
    st1 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),   
                        .combine="rbind",
                        .packages="data.table")  %dopar% {
        f_lm(dt.sub$value, dt.sub$key[[1]])
    })
    # isplit/rbindlist
    st2 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),  
                    .combine='dtcomb', .multicombine=TRUE,
                    .packages="data.table") %dopar% {
        f_lm(dt.sub$value, dt.sub$key[[1]])
    })
    # isplitDT/rbind
    st3 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp',     unique(dt.all$grp)),
                .combine='dtcomb', .multicombine=TRUE,
                .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
    })
    # isplitDT/rbindlist
    st4 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
                    .combine='dtcomb', .multicombine=TRUE,
                    .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      })
    
    rbind(st1, st2, st3, st4)
    

    This gives the following timings:

        user.self sys.self elapsed user.child sys.child
    st1     12.08     1.53   14.66         NA        NA
    st2     12.05     1.41   14.08         NA        NA
    st3     45.33     2.40   48.14         NA        NA
    st4     45.00     3.30   48.70         NA        NA
    

    ------------------------------------ EDIT 2 ------------------------------------

    Thanks to Steve's updated answer and the function isplitDT2, which makes use of the keys on the data.table, we have a clear new winner in terms of speed. Running microbenchmark to compare my original solution (in this answer) shows around 7-fold improvement from isplitDT2 with rbindlist. Memory usage has not yet been compared directly but the performance gain leads me to accept the answer at last.

提交回复
热议问题