Improving model training speed in caret (R)

后端 未结 3 1877
轮回少年
轮回少年 2021-01-31 12:30

I have a dataset consisting of 20 features and roughly 300,000 observations. I\'m using caret to train model with doParallel and four cores. Even training on 10% of my data ta

3条回答
  •  被撕碎了的回忆
    2021-01-31 12:47

    Great inputs by @phiver and @topepo. I will try to summarize and add some more points that I gathered from the little bit of SO posts searching that I did for a similar problem:

    • Yes, parallel processing takes more time, with lesser memory. With 8 cores and 64GB RAM, a rule of thumb could be to use 5-6 workers at best.
    • @topepo's page on caret pre-processing here is fantastic. It is step-wise instructive and helps to replace the manual work of pre-processing such as dummy variables, removing multi-collinear /linear combination variables and transformation.
    • One of the reasons the randomForest and other models become really slow is because of the number of factors in categorical variables. It is either advised to club factors or convert to ordinal/numeric transformation if possible.
    • Try using the Tunegrid feature in caret to the fullest for the ensemble models. Start with least values of mtry/ntree for a sample of data and see how it works out in terms of improvement in accuracy improvement.
    • I found out this SO page to be very useful where parRF is suggested primarily. I didn't a lot of improvement in my dataset by replacing RF with parRF but you can try out. The other suggestions there is to use data.table instead of dataframes and use predictor/response data instead of formula. It greatly improves the speed, believe me (But there is a caveat, the performance of predictor/response data (providing x=X, y=Y data.tables) also seems to somehow improve predictive accuracy somehow and change the Variable importance table from factor-wise break up while using formula (Y~.).

提交回复
热议问题