Memory issue with foreach loop in R on Windows 8 (64-bit) (doParallel package)

前端 未结 3 859
夕颜
夕颜 2021-02-05 22:32

I\'m trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table. The table contains data fo

3条回答
  •  忘了有多久
    2021-02-05 23:23

    Iterators can help to reduce the amount of memory that needs to be passed to the workers of a parallel program. Since you're using the data.table package, it's a good idea to use iterators and combine functions that are optimized for data.table objects. For example, here is a function like isplit that works on data.table objects:

    isplitDT <- function(x, colname, vals) {
      colname <- as.name(colname)
      ival <- iter(vals)
      nextEl <- function() {
        val <- nextElem(ival)
        list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
      }
      obj <- list(nextElem=nextEl)
      class(obj) <- c('abstractiter', 'iter')
      obj
    }
    

    Note that it isn't completely compatible with isplit, since the arguments and return value are slightly different. There may also be a better way to subset the data.table, but I think this is more efficient than using isplit.

    Here is your example using isplitDT and a combine function that uses rbindlist which combines data.tables faster than rbind:

    dtcomb <- function(...) {
      rbindlist(list(...))
    }
    
    results <- 
      foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
              .combine='dtcomb', .multicombine=TRUE,
              .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      }
    

    Update

    I wrote a new iterator function called isplitDT2 which performs much better than isplitDT but requires that the data.table have a key:

    isplitDT2 <- function(x, vals) {
      ival <- iter(vals)
      nextEl <- function() {
        val <- nextElem(ival)
        list(value=x[val], key=val)
      }
      obj <- list(nextElem=nextEl)
      class(obj) <- c('abstractiter', 'iter')
      obj
    }
    

    This is called as:

    setkey(dt.all, grp)
    results <-
      foreach(dt.sub=isplitDT2(dt.all, levels(dt.all$grp)),
              .combine='dtcomb', .multicombine=TRUE,
              .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      }
    

    This uses a binary search to subset dt.all rather than a vector scan, and so is more efficient. I don't know why isplitDT would use more memory, however. Since you're using doParallel, which doesn't call the iterator on-the-fly as it sends out tasks, you might want to experiment with splitting dt.all and then removing it to reduce your memory usage:

    dt.split <- as.list(isplitDT2(dt.all, levels(dt.all$grp)))
    rm(dt.all)
    gc()
    results <- 
      foreach(dt.sub=dt.split,
              .combine='dtcomb', .multicombine=TRUE,
              .packages='data.table') %dopar% {
        f_lm(dt.sub$value, dt.sub$key)
      }
    

    This may help by reducing the amount of memory needed by the master process during the execution of the foreach loop, while still only sending the required data to the workers. If you still have memory problems, you could also try using doMPI or doRedis, both of which get iterator values as needed, rather than all at once, making them more memory efficient.

提交回复
热议问题