Feature importances - Bagging, scikit-learn

后端 未结 2 1450
忘了有多久
忘了有多久 2021-01-04 21:14

For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and i

相关标签:
2条回答
  • 2021-01-04 22:00

    Are you talking about BaggingClassifier? It can be used with many base estimators, so there is no feature importances implemented. There are model-independent methods for computing feature importances (see e.g. https://github.com/scikit-learn/scikit-learn/issues/8898), scikit-learn doesn't use them.

    In case of decision trees as base estimators you can compute feature importances yourselves: it'd be just an average of tree.feature_importances_ among all trees in bagging.estimators_:

    import numpy as np
    from sklearn.ensemble import BaggingClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.datasets import load_iris
    
    X, y = load_iris(return_X_y=True)
    clf = BaggingClassifier(DecisionTreeClassifier())
    clf.fit(X, y)
    
    feature_importances = np.mean([
        tree.feature_importances_ for tree in clf.estimators_
    ], axis=0)
    

    RandomForestClassifer does the same computation internally.

    0 讨论(0)
  • 2021-01-04 22:03

    I encountered the same problem, and average feature importance was what I was interested in. Furthermore, I needed to have a feature_importance_ attribute exposed by (i.e. accessible from) the bagging classifier object. This was necessary to be used in another scikit-learn algorithm (i.e. RFE with an ROC_AUC scorer).

    I chose to overload the BaggingClassifier, to gain a direct access to the mean feature_importance (or "coef_" parameter) of the base estimators.

    Here is how to do so:

    class BaggingClassifierCoefs(BaggingClassifier):
        def __init__(self,**kwargs):
            super().__init__(**kwargs)
            # add attribute of interest
            self.feature_importances_ = None
        def fit(self, X, y, sample_weight=None):
            # overload fit function to compute feature_importance
            fitted = self._fit(X, y, self.max_samples, sample_weight=sample_weight) # hidden fit function
            if hasattr(fitted.estimators_[0], 'feature_importances_'):
                self.feature_importances_ =  np.mean([tree.feature_importances_ for tree in fitted.estimators_], axis=0)
            else:
                self.feature_importances_ =  np.mean([tree.coef_ for tree in fitted.estimators_], axis=0)
        return(fitted)
    
    0 讨论(0)
提交回复
热议问题