Scikit-learn confusion matrix

前端 未结 4 1374
情书的邮戳
情书的邮戳 2021-02-01 04:15

I can\'t figure out if I\'ve setup my binary classification problem correctly. I labeled the positive class 1 and the negative 0. However It is my understanding that by default

相关标签:
4条回答
  • 2021-02-01 04:34

    scikit learn sorts labels in ascending order, thus 0's are first column/row and 1's are the second one

    >>> from sklearn.metrics import confusion_matrix as cm
    >>> y_test = [1, 0, 0]
    >>> y_pred = [1, 0, 0]
    >>> cm(y_test, y_pred)
    array([[2, 0],
           [0, 1]])
    >>> y_pred = [4, 0, 0]
    >>> y_test = [4, 0, 0]
    >>> cm(y_test, y_pred)
    array([[2, 0],
           [0, 1]])
    >>> y_test = [-2, 0, 0]
    >>> y_pred = [-2, 0, 0]
    >>> cm(y_test, y_pred)
    array([[1, 0],
           [0, 2]])
    >>> 
    

    This is written in the docs:

    labels : array, shape = [n_classes], optional List of labels to index the matrix. This may be used to reorder or select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.

    Thus you can alter this behavior by providing labels to confusion_matrix call

    >>> y_test = [1, 0, 0]
    >>> y_pred = [1, 0, 0]
    >>> cm(y_pred, y_pred)
    array([[2, 0],
           [0, 1]])
    >>> cm(y_pred, y_pred, labels=[1, 0])
    array([[1, 0],
           [0, 2]])
    

    And actual/predicted are oredered just like in your images - predictions are in columns and actual values in rows

    >>> y_test = [5, 5, 5, 0, 0, 0]
    >>> y_pred = [5, 0, 0, 0, 0, 0]
    >>> cm(y_test, y_pred)
    array([[3, 0],
           [2, 1]])
    
    • true: 0, predicted: 0 (value: 3, position [0, 0])
    • true: 5, predicted: 0 (value: 2, position [1, 0])
    • true: 0, predicted: 5 (value: 0, position [0, 1])
    • true: 5, predicted: 5 (value: 1, position [1, 1])
    0 讨论(0)
  • 2021-02-01 04:38

    Short answer In binary classification, when using the argument labels ,

    confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0], labels=[0,1]).ravel()
    

    the class labels, 0, and 1, are considered to be Negative and Positive, respectively. This is due to the order implied by the list, and not the alpha-numerical order.


    Verification: Consider imbalanced class labels like this: (using imbalance class to make the distinction easier)

    >>> y_true = [0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0]
    >>> y_pred = [0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0]
    >>> table = confusion_matrix(y_true, y_pred, labels=[0,1]).ravel()
    

    this would give you a confusion table as follows:

    >>> table
    array([12,  1,  2,  1])
    

    which corresponds to:

                  Actual
            |   1   |   0  |
         ___________________
    pred  1 |  TP=1 | FP=1 |
          0 |  FN=2 | TN=12|
    

    where FN=2 means that there were 2 cases where the model predicted the sample to be negative (i.e., 0) but the actual label was positive (i.e., 1), hence False Negative equals 2.

    Similarly for TN=12, in 12 cases the model correctly predicted the negative class (0), hence True Negative equals 12.

    This way everything adds up assuming that sklearn considers the first label (in labels=[0,1] as the negative class. Therefore, here, 0, the first label, represents the negative class.

    0 讨论(0)
  • 2021-02-01 04:39

    Supporting Answer:

    When drawing the confusion matrix values using sklearn.metrics, be aware that the order of the values are

    [ True Negative False positive] [ False Negative True Positive ]

    If you interpret the values wrong, say TP for TN, your accuracies and AUC_ROC will more or less match, but your precision, recall, sensitivity, and f1-score will take a hit and you will end up with completely different metrics. This will result in you making a false judgement of your model's performance.

    Do make sure to clearly identify what the 1 and 0 in your model represent. This heavily dictates the results of the confusion matrix.

    Experience:

    I was working on predicting fraud (binary supervised classification), where fraud was denoted by 1 and non-fraud by 0. My model was trained on a scaled up, perfectly balanced data set, hence during in-time testing, values of confusion matrix did not seem suspicious when my results were of the order [TP FP] [FN TN]

    Later, when I had to perform an out-of-time test on a new imbalanced test set, I realized that the above order of confusion matrix was wrong and different from the one mentioned on sklearn's documentation page which refers to the order as tn,fp,fn,tp. Plugging in the new order made me realize the blunder and what a difference it had caused in my judgement of the model's performance.

    0 讨论(0)
  • 2021-02-01 04:56

    Following the example of wikipedia. If a classification system has been trained to distinguish between cats and non cats, a confusion matrix will summarize the results of testing the algorithm for further inspection. Assuming a sample of 27 animals — 8 cats, and 19 non cats, the resulting confusion matrix could look like the table below:

    With sklearn

    If you want to maintain the structure of the wikipedia confusion matrix, first go the predicted values and then the actual class.

    from sklearn.metrics import confusion_matrix
    y_true = [0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,1,0,1,0,0,0,0]
    y_pred = [0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0]
    confusion_matrix(y_pred, y_true, labels=[1,0])
    
    Out[1]: 
    array([[ 5,  2],
           [ 3, 17]], dtype=int64)
    

    Another way with crosstab pandas

    true = pd.Categorical(list(np.where(np.array(y_true) == 1, 'cat','non-cat')), categories = ['cat','non-cat'])
    pred = pd.Categorical(list(np.where(np.array(y_pred) == 1, 'cat','non-cat')), categories = ['cat','non-cat'])
    
    pd.crosstab(pred, true, 
                rownames=['pred'], 
                colnames=['Actual'], margins=False, margins_name="Total")
    
    Out[2]: 
    Actual   cat  non-cat
    pred                 
    cat        5        2
    non-cat    3       17
    

    I hope it serves you

    0 讨论(0)
提交回复
热议问题