using confusion matrix as scoring metric in cross validation in scikit learn

前端 未结 5 604
野性不改
野性不改 2021-01-31 11:15

I am creating a pipeline in scikit learn,

pipeline = Pipeline([
    (\'bow\', CountVectorizer()),  
    (\'classifier\', BernoulliNB()), 
])

a

相关标签:
5条回答
  • 2021-01-31 11:40

    Short answer is "you cannot".

    You need to understand difference between cross_val_score and cross validation as model selection method. cross_val_score as name suggests, works only on scores. Confusion matrix is not a score, it is a kind of summary of what happened during evaluation. A major distinction is that a score is supposed to return an orderable object, in particular in scikit-learn - a float. So, based on score you can tell whether method b is better from a by simply comparing if b has bigger score. You cannot do this with confusion matrix which, again as name suggests, is a matrix.

    If you want to obtain confusion matrices for multiple evaluation runs (such as cross validation) you have to do this by hand, which is not that bad in scikit-learn - it is actually a few lines of code.

    kf = cross_validation.KFold(len(y), n_folds=5)
    for train_index, test_index in kf:
    
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
    
       model.fit(X_train, y_train)
       print confusion_matrix(y_test, model.predict(X_test))
    
    0 讨论(0)
  • 2021-01-31 11:45

    What you can do though is to define a scorer that uses certain values from the confusion matrix. See here [link]. Just citing the code:

    def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
    def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
    def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
    def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
    scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
               'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
    cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
    

    This will perform the cross validation for each of these four scorers and return the scoring dictionary cv_results, e.g., with keys test_tp, test_tn, etc. containing the confusion matrices' values from each cross-validation split.

    From this you could reconstruct an average confusion matrix, but the cross_val_predict of Xema seems more elegant for this.

    Note that this will actually not work with cross_val_score; you'll need cross_validate (introduced in scikit-learn v0.19).

    Side note: you could use one of these scorers (i.e. one element of the matrix) for hyper-parameter optimization via grid search.

    *EDIT: true negatives are returned at [1, 1], not [0, 0]

    0 讨论(0)
  • 2021-01-31 11:55

    I am new to machine learning. If I understand correctly, the confusion matrix can obtain from 4 value, which are TP, FN, FP and TN. Those 4 value cannot obtain directly from scoring, but it is implied in accuracy, precision and recall.

    Now it has 4 unknown TP, FN, FP and TN.

    Eq1 : tp/(tp+fp)=P

    Eq2 : tp/(tp+fn)=R

    Eq3 : (tp+tn)/(tp+fn+fp+tn)=A

    [1]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfp%7D%3DP
    [2]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfn%7D%3DR
    [3]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%2Btn%7D%7Btp%2Bfn%2Bfp%2Btn%7D%3DA
    

    Assuming one of the unknown is 1, then it becomes 3 unknown and 3 equations. The relative value can be solved using system of equation.

    1. P R A can obtain from scoring

    2. cross_validate can get all 3 source at one time

    def calculate_confusion_matrix_by_assume_tp_equal_to_1(r, p, a):
        # tp/(tp+fp)=P, tp/(tp+fn)=R, (tp+tn)/(tp+fn+fp+tn)=A
        fn = (1 / r) - 1
        fp = (1 / p) - 1
        tn = (1 - a - a * fn - a * fp) / (a - 1)
        return fn, fp, tn
    
    0 讨论(0)
  • 2021-01-31 12:01

    You could use cross_val_predict(See the scikit-learn docs) instead of cross_val_score.

    instead of doing :

    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(clf, x, y, cv=10)
    

    you can do :

    from sklearn.model_selection import cross_val_predict
    from sklearn.metrics import confusion_matrix
    y_pred = cross_val_predict(clf, x, y, cv=10)
    conf_mat = confusion_matrix(y, y_pred)
    
    0 讨论(0)
  • 2021-01-31 12:05

    I think what you really want is average of confusion matrices obtained from each cross-validation run. @lejlot already nicely explained why, I'll just upgrade his answer with calculation of mean of confusion matrices:

    Calculate confusion matrix in each run of cross validation. You can use something like this:

    conf_matrix_list_of_arrays = []
    kf = cross_validation.KFold(len(y), n_folds=5)
    for train_index, test_index in kf:
    
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
    
       model.fit(X_train, y_train)
       conf_matrix = confusion_matrix(y_test, model.predict(X_test))
       conf_matrix_list_of_arrays .append(conf_matrix)
    

    On the end you can calculate your mean of list of numpy arrays (confusion matrices) with:

    mean_of_conf_matrix_arrays = np.mean(conf_matrix_list_of_arrays, axis=0)
    
    0 讨论(0)
提交回复
热议问题