Scikit - changing the threshold to create multiple confusion matrixes

前端 未结 1 1637
悲&欢浪女
悲&欢浪女 2021-02-09 01:26

I\'m building a classifier that goes through lending club data, and selects the best X loans. I\'ve trained a Random Forest, and created the usual ROC curves, Confusion Matrices

1条回答
  •  名媛妹妹
    2021-02-09 02:05

    A. In your case, changing the threshold is admissible and maybe even necessary. The default threshold is at 50%, but from business point of view even 15% probability of non-repayment might be enough to reject such an application.

    In fact, in credit scoring it is common to set different cut-offs for different product terms or customer segments, after predicting probability of default with a common model (see e.g. chapter 9 of "Credit Risk Scorecards" by Naeem Siddiqi).

    B. There are two convenient ways to threshold at arbitrary alpha instead of 50%:

    1. Indeed, predict_proba and threshold it to alpha manually, or with a wrapper class (see the code below). Use this if you want to try multiple thresholds without refitting the model.
    2. Change class_weights to (alpha, 1-alpha) before fitting the model.

    And now, a sample code for the wrapper:

    import numpy as np
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.pipeline import make_pipeline
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix
    from sklearn.base import BaseEstimator, ClassifierMixin
    X, y = make_classification(random_state=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    class CustomThreshold(BaseEstimator, ClassifierMixin):
        """ Custom threshold wrapper for binary classification"""
        def __init__(self, base, threshold=0.5):
            self.base = base
            self.threshold = threshold
        def fit(self, *args, **kwargs):
            self.base.fit(*args, **kwargs)
            return self
        def predict(self, X):
            return (self.base.predict_proba(X)[:, 1] > self.threshold).astype(int)
    
    rf = RandomForestClassifier(random_state=1).fit(X_train, y_train)
    clf = [CustomThreshold(rf, threshold) for threshold in [0.3, 0.5, 0.7]]
    
    for model in clf:
        print(confusion_matrix(y_test, model.predict(X_test)))
    
    assert((clf[1].predict(X_test) == clf[1].base.predict(X_test)).all())
    assert(sum(clf[0].predict(X_test)) > sum(clf[0].base.predict(X_test)))
    assert(sum(clf[2].predict(X_test)) < sum(clf[2].base.predict(X_test)))
    

    It will output 3 confusion matrices for different thresholds:

    [[13  1]
     [ 2  9]]
    [[14  0]
     [ 3  8]]
    [[14  0]
     [ 4  7]]
    

    0 讨论(0)
提交回复
热议问题