Memory issue with foreach loop in R on Windows 8 (64-bit) (doParallel package)

前端 未结 3 857
夕颜
夕颜 2021-02-05 22:32

I\'m trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table. The table contains data fo

相关标签:
3条回答
  • 2021-02-05 23:09

    Holding everything in memory is one of those (aargh, annoying) things that R programmers have to learn to deal with. It's pretty easy to imagine your code example as either memory-bound or CPU-bound, and you'll need to figure that out before trying to apply workarounds.

    Assuming the memory is being consumed by your dataset (dt_all) and not during the actual model run, it is possible you might be able to release enough memory for the worker processes to parallelize:

    foreach(grp=unique(dt.all$grp), .packages="data.table", .combine="rbind")  %dopar%
    {
        dt.sub = dt.all[grp == grp]
        rm(dt.all)
        gc()
        f_lm(dt.sub, grp)
    }
    

    However, this assumes that your working set (dt.sub) is small enough that you can fit more than one of them in memory at a time. It isn't hard to imagine a problem set too large for that. Also, and this is really annoying, all the workers are going to fire up at one time and kill your machine anyway, so you might need to make them pause for a couple seconds to allow other children to load up and release memory.

    Though desperately stupid and brute-force, I have handled this exact problem by writing the subsets out to disk as individual data files, and then used a batch script to run my computations in parallel.

    0 讨论(0)
  • 2021-02-05 23:23

    Iterators can help to reduce the amount of memory that needs to be passed to the workers of a parallel program. Since you're using the data.table package, it's a good idea to use iterators and combine functions that are optimized for data.table objects. For example, here is a function like isplit that works on data.table objects:

    isplitDT <- function(x, colname, vals) {
      colname <- as.name(colname)
      ival <- iter(vals)
      nextEl <- function() {
        val <- nextElem(ival)
        list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
      }
      obj <- list(nextElem=nextEl)
      class(obj) <- c('abstractiter', 'iter')
      obj
    }
    

    Note that it isn't completely compatible with isplit, since the arguments and return value are slightly different. There may also be a better way to subset the data.table, but I think this is more efficient than using isplit.

    Here is your example using isplitDT and a combine function that uses rbindlist which combines data.tables faster than rbind:

    dtcomb <- function(...) {
      rbindlist(list(...))
    }
    
    results <- 
      foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
              .combine='dtcomb', .multicombine=TRUE,
              .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      }
    

    Update

    I wrote a new iterator function called isplitDT2 which performs much better than isplitDT but requires that the data.table have a key:

    isplitDT2 <- function(x, vals) {
      ival <- iter(vals)
      nextEl <- function() {
        val <- nextElem(ival)
        list(value=x[val], key=val)
      }
      obj <- list(nextElem=nextEl)
      class(obj) <- c('abstractiter', 'iter')
      obj
    }
    

    This is called as:

    setkey(dt.all, grp)
    results <-
      foreach(dt.sub=isplitDT2(dt.all, levels(dt.all$grp)),
              .combine='dtcomb', .multicombine=TRUE,
              .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      }
    

    This uses a binary search to subset dt.all rather than a vector scan, and so is more efficient. I don't know why isplitDT would use more memory, however. Since you're using doParallel, which doesn't call the iterator on-the-fly as it sends out tasks, you might want to experiment with splitting dt.all and then removing it to reduce your memory usage:

    dt.split <- as.list(isplitDT2(dt.all, levels(dt.all$grp)))
    rm(dt.all)
    gc()
    results <- 
      foreach(dt.sub=dt.split,
              .combine='dtcomb', .multicombine=TRUE,
              .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      }
    

    This may help by reducing the amount of memory needed by the master process during the execution of the foreach loop, while still only sending the required data to the workers. If you still have memory problems, you could also try using doMPI or doRedis, both of which get iterator values as needed, rather than all at once, making them more memory efficient.

    0 讨论(0)
  • 2021-02-05 23:28

    The answer requires the iterators package and use of isplit which is similar to split in that it breaks the main data object into chunks based on one or more factor columns. The foreach loop iterates through the chunks of data, passing only the subset out to the worker process rather than the whole table.

    So the differences in the code are as follows:

    library(iterators)
    dt.all = data.table(
        grp     = factor(rep(1:num.series, each  =num.periods)),  # grp column is a factor
        pd      = rep(1:num.periods, num.series), 
        y       = rnorm(num.series * num.periods),
        x1      = rnorm(num.series * num.periods),
        x2      = rnorm(num.series * num.periods)
    ) 
    
    results = 
        foreach(dt.sub = isplit(dt.all, dt.all$grp), .packages="data.table", .combine="rbind")  
        %dopar%
        {
            f_lm(dt.sub$value, dt.sub$key[[1]])
        }
    

    The result of the isplit is that dt.sub is now a list with 2 elements: the key is in itself a list of the values used to split and the value contains the subset as a data.table.

    Credit for this solution is given to a SO answer given by David and a response by Russell to my question on an excellent blog post about iterators.

    ------------------------------------ EDIT ------------------------------------

    To test the performance of isplitDT v isplit and rbindlist v rbind the following code was used:

    rm(list=ls())
    library(data.table) ; library(iterators)  ;   library(doParallel)
    num.series = 400
    num.periods = 2000
    dt.all = data.table(
        grp     = factor(rep(1:num.series,each=num.periods)), 
        pd      = rep(1:num.periods, num.series), 
        y       = rnorm(num.series * num.periods),
        x1      = rnorm(num.series * num.periods),
        x2      = rnorm(num.series * num.periods)
    )
    dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
    
    f_lm = function(dt.sub, grp) {
        my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
        coef = summary(my.model)$coefficients
        data.table(grp, variable = rownames(coef), coef)
    }
    
    registerDoParallel(8)
    
    isplitDT <- function(x, colname, vals) {
      colname <- as.name(colname)
      ival <- iter(vals)
      nextEl <- function() {
        val <- nextElem(ival)
        list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
      }
      obj <- list(nextElem=nextEl)
      class(obj) <- c('abstractiter', 'iter')
      obj
    }
    
    dtcomb <- function(...) {
      rbindlist(list(...))
    }
    
    # isplit/rbind
    st1 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),   
                        .combine="rbind",
                        .packages="data.table")  %dopar% {
        f_lm(dt.sub$value, dt.sub$key[[1]])
    })
    # isplit/rbindlist
    st2 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),  
                    .combine='dtcomb', .multicombine=TRUE,
                    .packages="data.table") %dopar% {
        f_lm(dt.sub$value, dt.sub$key[[1]])
    })
    # isplitDT/rbind
    st3 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp',     unique(dt.all$grp)),
                .combine='dtcomb', .multicombine=TRUE,
                .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
    })
    # isplitDT/rbindlist
    st4 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
                    .combine='dtcomb', .multicombine=TRUE,
                    .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      })
    
    rbind(st1, st2, st3, st4)
    

    This gives the following timings:

        user.self sys.self elapsed user.child sys.child
    st1     12.08     1.53   14.66         NA        NA
    st2     12.05     1.41   14.08         NA        NA
    st3     45.33     2.40   48.14         NA        NA
    st4     45.00     3.30   48.70         NA        NA
    

    ------------------------------------ EDIT 2 ------------------------------------

    Thanks to Steve's updated answer and the function isplitDT2, which makes use of the keys on the data.table, we have a clear new winner in terms of speed. Running microbenchmark to compare my original solution (in this answer) shows around 7-fold improvement from isplitDT2 with rbindlist. Memory usage has not yet been compared directly but the performance gain leads me to accept the answer at last.

    0 讨论(0)
提交回复
热议问题