Incremental training of random forest model using python sklearn

人盡茶涼 提交于 2019-12-21 04:50:09

问题


I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally. Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model.

rf =  RandomForestRegressor(n_estimators=100)
print ("Trying to fit the Random Forest model --> ")
if os.path.exists('rf.pkl'):
    print ("Trained model already pickled -- >")
    with open('rf.pkl', 'rb') as f:
        rf = cPickle.load(f)
else:
    df_x_train = x_train[col_feature]
    rf.fit(df_x_train,y_train)
    print ("Training for the model done ")
    with open('rf.pkl', 'wb') as f:
        cPickle.dump(rf, f)
df_x_test = x_test[col_feature]
pred = rf.predict(df_x_test)

EDIT 1: I don't have the compute capacity to train the model on 4 years of data all at once.


回答1:


What you're talking about, updating a model with additional data incrementally, is discussed in the sklearn User Guide:

Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.

They include a list of classifiers and regressors implementing partial_fit(), but RandomForest is not among them. You can also confirm RFRegressor does not implement partial fit on the documentation page for RandomForestRegressor.

Some possible ways forward:

  • Use a regressor which does implement partial_fit(), such as SGDRegressor
  • Check your RandomForest model's feature_importances_ attribute, then retrain your model on 3 or 4 years of data after dropping unimportant features
  • Train your model on only the most recent two years of data, if you can only use two years
  • Train your model on a random subset drawn from all four years of data.
  • Change the tree_depth parameter to constrain how complicated your model can get. This saves computation time and so may allow you to use all your data. It can also prevent overfitting. Use Cross-Validation to select the best tree-depth hyperparameter for your problem
  • Set your RF model's param n_jobs=-1 if you haven't already,to use multiple cores/processors on your machine.
  • Use a faster ensemble-tree-based algorithm, such as xgboost
  • Run your model-fitting code on a large machine in the cloud, such as AWS or dominodatalab



回答2:


You can set the 'warm_start' parameter to True in the model. This will ensure the retention of learning with previous learn using fit call.

Same model learning incrementally two times (train_X[:1], train_X[1:2]) after setting ' warm_start '

forest_model = RandomForestRegressor(warm_start=True)
forest_model.fit(train_X[:1],train_y[:1])
pred_y = forest_model.predict(val_X[:1])
mae = mean_absolute_error(pred_y,val_y[:1])
print("mae      :",mae)
print('pred_y :',pred_y)
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae      :",mae)
print('pred_y :',pred_y)

mae : 1290000.0 pred_y : [ 1630000.] mae : 925000.0 pred_y : [ 1630000.]

Model only with the last learnt values ( train_X[1:2] )

forest_model = RandomForestRegressor()
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae      :",mae)
print('pred_y :',pred_y)

mae : 515000.0 pred_y : [ 1220000.]

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html



来源:https://stackoverflow.com/questions/44060432/incremental-training-of-random-forest-model-using-python-sklearn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!