I am creating a pipeline in scikit learn,
pipeline = Pipeline([
(\'bow\', CountVectorizer()),
(\'classifier\', BernoulliNB()),
])
a
What you can do though is to define a scorer that uses certain values from the confusion matrix. See here [link]. Just citing the code:
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
This will perform the cross validation for each of these four scorers and return the scoring dictionary cv_results
, e.g., with keys test_tp
, test_tn
, etc. containing the confusion matrices' values from each cross-validation split.
From this you could reconstruct an average confusion matrix, but the cross_val_predict
of Xema seems more elegant for this.
Note that this will actually not work with cross_val_score
; you'll need cross_validate
(introduced in scikit-learn v0.19).
Side note: you could use one of these scorers (i.e. one element of the matrix) for hyper-parameter optimization via grid search.
*EDIT: true negatives are returned at [1, 1], not [0, 0]