I\'m trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table
. The table contains data fo
Iterators can help to reduce the amount of memory that needs to be passed to the workers of a parallel program. Since you're using the data.table package, it's a good idea to use iterators and combine functions that are optimized for data.table objects. For example, here is a function like isplit
that works on data.table objects:
isplitDT <- function(x, colname, vals) {
colname <- as.name(colname)
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
Note that it isn't completely compatible with isplit
, since the arguments and return value are slightly different. There may also be a better way to subset the data.table, but I think this is more efficient than using isplit
.
Here is your example using isplitDT
and a combine function that uses rbindlist
which combines data.tables faster than rbind
:
dtcomb <- function(...) {
rbindlist(list(...))
}
results <-
foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
Update
I wrote a new iterator function called isplitDT2
which performs much better than isplitDT
but requires that the data.table have a key:
isplitDT2 <- function(x, vals) {
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=x[val], key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
This is called as:
setkey(dt.all, grp)
results <-
foreach(dt.sub=isplitDT2(dt.all, levels(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
This uses a binary search to subset dt.all
rather than a vector scan, and so is more efficient. I don't know why isplitDT
would use more memory, however. Since you're using doParallel
, which doesn't call the iterator on-the-fly as it sends out tasks, you might want to experiment with splitting dt.all
and then removing it to reduce your memory usage:
dt.split <- as.list(isplitDT2(dt.all, levels(dt.all$grp)))
rm(dt.all)
gc()
results <-
foreach(dt.sub=dt.split,
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
This may help by reducing the amount of memory needed by the master process during the execution of the foreach loop, while still only sending the required data to the workers. If you still have memory problems, you could also try using doMPI or doRedis, both of which get iterator values as needed, rather than all at once, making them more memory efficient.