I have a dataset consisting of 20 features and roughly 300,000 observations. I\'m using caret to train model with doParallel and four cores. Even training on 10% of my data ta
@phiver hits the nail on the head but, for this situation, there are a few things to suggest:
Max
What people forget when comparing the underlying model versus using caret is that caret has a lot of extra stuff going on.
Take for example your randomforest. so bootstrap, number 3, and tuneLength 5. So you resample 3 times, and because of the tuneLength you try to find a good value for mtry. In total you run 15 random forests and comparing these to get the best one for the final model, versus only 1 if you use the basic random forest model.
Also you are running parallel on 4 cores and randomforest needs all the observations available, so all your training observations will be 4 times in memory. Probably not much memory left for training the model.
My advice is to start scaling down to see if you can speed things up, like setting the bootstrap number to 1 and tune length back to the default 3. Or even setting the traincontrol method to "none", just to get an idea on how fast the model is on the minimal settings and no resampling.
Great inputs by @phiver and @topepo. I will try to summarize and add some more points that I gathered from the little bit of SO posts searching that I did for a similar problem: