using confusion matrix as scoring metric in cross validation in scikit learn

前端未结

关注

 5  615

I am creating a pipeline in scikit learn,

pipeline = Pipeline([
    (\'bow\', CountVectorizer()),  
    (\'classifier\', BernoulliNB()), 
])

相关标签:

5条回答

梦毁少年i

2021-01-31 11:40
Short answer is "you cannot".

You need to understand difference between cross_val_score and cross validation as model selection method. cross_val_score as name suggests, works only on scores. Confusion matrix is not a score, it is a kind of summary of what happened during evaluation. A major distinction is that a score is supposed to return an orderable object, in particular in scikit-learn - a float. So, based on score you can tell whether method b is better from a by simply comparing if b has bigger score. You cannot do this with confusion matrix which, again as name suggests, is a matrix.

If you want to obtain confusion matrices for multiple evaluation runs (such as cross validation) you have to do this by hand, which is not that bad in scikit-learn - it is actually a few lines of code.
```
kf = cross_validation.KFold(len(y), n_folds=5)
for train_index, test_index in kf:

   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

   model.fit(X_train, y_train)
   print confusion_matrix(y_test, model.predict(X_test))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2021-01-31 11:45
What you can do though is to define a scorer that uses certain values from the confusion matrix. See here [link]. Just citing the code:
```
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
           'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
```
This will perform the cross validation for each of these four scorers and return the scoring dictionary cv_results, e.g., with keys test_tp, test_tn, etc. containing the confusion matrices' values from each cross-validation split.

From this you could reconstruct an average confusion matrix, but the cross_val_predict of Xema seems more elegant for this.

Note that this will actually not work with cross_val_score; you'll need cross_validate (introduced in scikit-learn v0.19).

Side note: you could use one of these scorers (i.e. one element of the matrix) for hyper-parameter optimization via grid search.

*EDIT: true negatives are returned at [1, 1], not [0, 0]
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘了有多久

2021-01-31 11:55
I am new to machine learning. If I understand correctly, the confusion matrix can obtain from 4 value, which are TP, FN, FP and TN. Those 4 value cannot obtain directly from scoring, but it is implied in accuracy, precision and recall.

Now it has 4 unknown TP, FN, FP and TN.

Eq1 : tp/(tp+fp)=P

Eq2 : tp/(tp+fn)=R

Eq3 : (tp+tn)/(tp+fn+fp+tn)=A
```
[1]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfp%7D%3DP
[2]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfn%7D%3DR
[3]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%2Btn%7D%7Btp%2Bfn%2Bfp%2Btn%7D%3DA
```
Assuming one of the unknown is 1, then it becomes 3 unknown and 3 equations. The relative value can be solved using system of equation.
1. P R A can obtain from scoring
2. cross_validate can get all 3 source at one time
```
def calculate_confusion_matrix_by_assume_tp_equal_to_1(r, p, a):
    # tp/(tp+fp)=P, tp/(tp+fn)=R, (tp+tn)/(tp+fn+fp+tn)=A
    fn = (1 / r) - 1
    fp = (1 / p) - 1
    tn = (1 - a - a * fn - a * fp) / (a - 1)
    return fn, fp, tn
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦毁少年i

2021-01-31 12:01

You could use cross_val_predict(See the scikit-learn docs) instead of cross_val_score.

instead of doing :

from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, x, y, cv=10)

you can do :

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
y_pred = cross_val_predict(clf, x, y, cv=10)
conf_mat = confusion_matrix(y, y_pred)

0 讨论(0)

遇见更好的自我

2021-01-31 12:05
I think what you really want is average of confusion matrices obtained from each cross-validation run. @lejlot already nicely explained why, I'll just upgrade his answer with calculation of mean of confusion matrices:

Calculate confusion matrix in each run of cross validation. You can use something like this:
```
conf_matrix_list_of_arrays = []
kf = cross_validation.KFold(len(y), n_folds=5)
for train_index, test_index in kf:

   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

   model.fit(X_train, y_train)
   conf_matrix = confusion_matrix(y_test, model.predict(X_test))
   conf_matrix_list_of_arrays .append(conf_matrix)
```
On the end you can calculate your mean of list of numpy arrays (confusion matrices) with:
```
mean_of_conf_matrix_arrays = np.mean(conf_matrix_list_of_arrays, axis=0)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...