Subsample size in scikit-learn RandomForestClassifier

走远了吗. 提交于 2021-02-09 08:21:11

问题


How is it possible to control the size of the subsample used for the training of each tree in the forest? According to the documentation of scikit-learn:

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

So bootstrap allows randomness but can't find how to control the number of subsample.


回答1:


Scikit-learn doesn't provide this, but you can easily get this option by using (slower) version using combination of tree and bagging meta-classifier:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5)

As a side-note, Breiman's random forest indeed doesn't consider subsample as a parameter, completely relying on bootstrap, so approximately (1 - 1 / e) of samples are used to build each tree.




回答2:


You can actually modify _generate_sample_indices function in forest.py to change the size of subsample each time, thanks fastai lib to implement a function set_rf_samples for that purpose, it looks like that

def set_rf_samples(n):
    """ Changes Scikit learn's random forests to give each tree a random sample of
    n random rows.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n))

you could add this function to your code



来源:https://stackoverflow.com/questions/40847745/subsample-size-in-scikit-learn-randomforestclassifier

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!