When I train just using glm
, everything works, and I don\'t even come close to exhausting memory. But when I run train(..., method=\'glm\')
, I run
Gavin's answer is spot on. I built the function for ease of use rather than for speed or efficiency [1]
First, using the formula interface can be an issue when you have a lot of predictors. This is something that R Core could fix; the formula approach requires a very large but sparse terms()
matrix to be retained and R has packages to effectively deal with that issue. For example, with n = 3, 000 and p = 2, 000, a 3–tree random forest model object was 1.5 times larger in size and took 23 times longer to execute when using the formula interface (282s vs 12s).
Second, you don't have to keep the training data (see the returnData
argument in trainControl()
).
Also, since R doesn't have any real shared memory infrastructure, Gavin is correct about the number of copies of the data that are retained in memory. Basically, a list is created for every resample and lapply()
is used to process the list, then return only the resampled estimates. An alternative would be to sequentially make one copy of the data (for the current resample), do the required operations, then repeat for the remaining iterations. The issue there is I/O and the inability to do any parallel processing. [2]
If you have a large data set, I suggest using the non-formula interface (even though the actual model, like glm, eventually uses a formula). Also, for large data sets, train()
saves the resampling indices for use by resamples()
and other functions. You could probably remove those too.
Yang - it would be good to know more about the data via str(data)
so we can understand the dimensions and other aspects (eg. factors with many levels etc).
I hope that helps,
Max
[1] I should not that we go to great lengths to fit as few models as possible when we can. The "sub-model" trick is used for many models, such as pls, gbm, rpart, earth and many others. Also, when a model has formula and non-formula interfaces (eg. lda()
or earth()
, we default to the non-formula interface.
[2] Every once in a while I get the insane urge to reboot the train()
function. Using foreach
might get around some of these issues.