Best way to combine probabilistic classifiers in scikit-learn

前端 未结 4 809
误落风尘
误落风尘 2021-01-30 11:46

I have a logistic regression and a random forest and I\'d like to combine them (ensemble) for the final classification probability calculation by taking an average.

Is t

相关标签:
4条回答
  • 2021-01-30 12:00

    NOTE: The scikit-learn Voting Classifier is probably the best way to do this now


    OLD ANSWER:

    For what it's worth I ended up doing this as follows:

    class EnsembleClassifier(BaseEstimator, ClassifierMixin):
        def __init__(self, classifiers=None):
            self.classifiers = classifiers
    
        def fit(self, X, y):
            for classifier in self.classifiers:
                classifier.fit(X, y)
    
        def predict_proba(self, X):
            self.predictions_ = list()
            for classifier in self.classifiers:
                self.predictions_.append(classifier.predict_proba(X))
            return np.mean(self.predictions_, axis=0)
    
    0 讨论(0)
  • 2021-01-30 12:01

    Now scikit-learn has StackingClassifier which can be used to stack multiple estimators.

    from sklearn.datasets import load_iris  
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import LinearSVC
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import make_pipeline
    from sklearn.ensemble import StackingClassifier
    X, y = load_iris(return_X_y=True)
    estimators = [
        ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
        ('lg', LogisticRegression()))
       ]
    clf = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression()
    )
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=42
    )
    clf.fit(X_train, y_train)
    clf.predict_proba(X_test)
    
    0 讨论(0)
  • 2021-01-30 12:09

    Given the same problem, I used a majority voting method. Combing probabilities/scores arbitrarily is very problematic, in that the performance of your different classifiers can be different, (For example, an SVM with 2 different kernels , + a Random forest + another classifier trained on a different training set).

    One possible method to "weigh" the different classifiers, might be to use their Jaccard score as a "weight". (But be warned, as I understand it, the different scores are not "all made equal", I know that a Gradient Boosting classifier I have in my ensemble gives all its scores as 0.97, 0.98, 1.00 or 0.41/0 . I.E. it's very overconfident..)

    0 讨论(0)
  • 2021-01-30 12:16

    What about the sklearn.ensemble.VotingClassifier?

    http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

    Per the description:

    The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

    0 讨论(0)
提交回复
热议问题