I am creating a pipeline in scikit learn,
pipeline = Pipeline([
(\'bow\', CountVectorizer()),
(\'classifier\', BernoulliNB()),
])
a
Short answer is "you cannot".
You need to understand difference between cross_val_score
and cross validation as model selection method. cross_val_score
as name suggests, works only on scores. Confusion matrix is not a score, it is a kind of summary of what happened during evaluation. A major distinction is that a score is supposed to return an orderable object, in particular in scikit-learn - a float. So, based on score you can tell whether method b is better from a by simply comparing if b has bigger score. You cannot do this with confusion matrix which, again as name suggests, is a matrix.
If you want to obtain confusion matrices for multiple evaluation runs (such as cross validation) you have to do this by hand, which is not that bad in scikit-learn - it is actually a few lines of code.
kf = cross_validation.KFold(len(y), n_folds=5)
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
print confusion_matrix(y_test, model.predict(X_test))
What you can do though is to define a scorer that uses certain values from the confusion matrix. See here [link]. Just citing the code:
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
This will perform the cross validation for each of these four scorers and return the scoring dictionary cv_results
, e.g., with keys test_tp
, test_tn
, etc. containing the confusion matrices' values from each cross-validation split.
From this you could reconstruct an average confusion matrix, but the cross_val_predict
of Xema seems more elegant for this.
Note that this will actually not work with cross_val_score
; you'll need cross_validate
(introduced in scikit-learn v0.19).
Side note: you could use one of these scorers (i.e. one element of the matrix) for hyper-parameter optimization via grid search.
*EDIT: true negatives are returned at [1, 1], not [0, 0]
I am new to machine learning. If I understand correctly, the confusion matrix can obtain from 4 value, which are TP, FN, FP and TN. Those 4 value cannot obtain directly from scoring, but it is implied in accuracy, precision and recall.
Now it has 4 unknown TP, FN, FP and TN.
Eq1 : tp/(tp+fp)=P
Eq2 : tp/(tp+fn)=R
Eq3 : (tp+tn)/(tp+fn+fp+tn)=A
[1]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfp%7D%3DP
[2]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfn%7D%3DR
[3]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%2Btn%7D%7Btp%2Bfn%2Bfp%2Btn%7D%3DA
Assuming one of the unknown is 1, then it becomes 3 unknown and 3 equations. The relative value can be solved using system of equation.
P R A can obtain from scoring
cross_validate can get all 3 source at one time
def calculate_confusion_matrix_by_assume_tp_equal_to_1(r, p, a):
# tp/(tp+fp)=P, tp/(tp+fn)=R, (tp+tn)/(tp+fn+fp+tn)=A
fn = (1 / r) - 1
fp = (1 / p) - 1
tn = (1 - a - a * fn - a * fp) / (a - 1)
return fn, fp, tn
You could use cross_val_predict
(See the scikit-learn docs) instead of cross_val_score
.
instead of doing :
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, x, y, cv=10)
you can do :
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
y_pred = cross_val_predict(clf, x, y, cv=10)
conf_mat = confusion_matrix(y, y_pred)
I think what you really want is average of confusion matrices obtained from each cross-validation run. @lejlot already nicely explained why, I'll just upgrade his answer with calculation of mean of confusion matrices:
Calculate confusion matrix in each run of cross validation. You can use something like this:
conf_matrix_list_of_arrays = []
kf = cross_validation.KFold(len(y), n_folds=5)
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
conf_matrix = confusion_matrix(y_test, model.predict(X_test))
conf_matrix_list_of_arrays .append(conf_matrix)
On the end you can calculate your mean of list of numpy arrays (confusion matrices) with:
mean_of_conf_matrix_arrays = np.mean(conf_matrix_list_of_arrays, axis=0)