using confusion matrix as scoring metric in cross validation in scikit learn

前端 未结 5 616
野性不改
野性不改 2021-01-31 11:15

I am creating a pipeline in scikit learn,

pipeline = Pipeline([
    (\'bow\', CountVectorizer()),  
    (\'classifier\', BernoulliNB()), 
])

a

5条回答
  •  北荒
    北荒 (楼主)
    2021-01-31 11:45

    What you can do though is to define a scorer that uses certain values from the confusion matrix. See here [link]. Just citing the code:

    def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
    def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
    def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
    def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
    scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
               'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
    cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
    

    This will perform the cross validation for each of these four scorers and return the scoring dictionary cv_results, e.g., with keys test_tp, test_tn, etc. containing the confusion matrices' values from each cross-validation split.

    From this you could reconstruct an average confusion matrix, but the cross_val_predict of Xema seems more elegant for this.

    Note that this will actually not work with cross_val_score; you'll need cross_validate (introduced in scikit-learn v0.19).

    Side note: you could use one of these scorers (i.e. one element of the matrix) for hyper-parameter optimization via grid search.

    *EDIT: true negatives are returned at [1, 1], not [0, 0]

提交回复
热议问题