sklearn precision_recall_curve and threshold

问题

I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example

import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

which then gives

>>>precision  
    array([0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
    array([1. , 0.5, 0.5, 0. ])
>>> thresholds
    array([0.35, 0.4 , 0.8 ])

Could someone explain to me how to get those recalls and precisions by showing me what is computed?

回答1:

I know I am a bit late here, but I had a similar doubt that the link you provided has cleared up. Roughly speaking, here is what happens inside precision_recall_curve() following sklearn implementation.

Decision scores are ordered in descending order and labels according to the just obtained order:

desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
y_scores = y_scores[desc_score_indices]
y_true = y_true[desc_score_indices]

You'll get:

y_scores, y_true
(array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))

sklearn implementation then foresees to exclude the duplicated values of y_scores (no duplicates in this example).

distinct_value_indices = np.where(np.diff(y_scores))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]

Due to the absence of duplicates you'll get:

distinct_value_indices, threshold_idxs 
(array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))

Finally you can compute the number of true positives and false positives through which you can in turn compute precision and recall.

# tps at index i being the number of positive samples assigned a score >= thresholds[i]
tps = np.cumsum(y_true)[threshold_idxs]
# fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
fps = np.cumsum(1 - y_true)[threshold_idxs]
y_scores = y_scores[threshold_idxs]

After this steps you'll have two arrays with the number of true positives and false positives per considered score.

tps, fps
(array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))

Eventually, you can compute precision and recall.
```
precision = tps / (tps + fps)
# tps[-1] being the total number of positive samples
recall = tps / tps[-1]

precision, recall
(array([1.        , 0.5       , 0.66666667, 0.5       ]), array([0.5, 0.5, 1. , 1. ]))
```
An important point that causes the thresholds array to be shorter than the y_score one (even though there are no duplicates in y_score) is the one that was pointed out within the link you referenced. Basically, the index of the first occurrence of recall equal to 1 defines the length of the thresholds array (index 2 here, corresponding to length=3 and reason why the length of thresholds is 3).
```
last_ind = tps.searchsorted(tps[-1])   # 2
sl = slice(last_ind, None, -1)         # from index 2 to 0

precision, recall, thresholds = np.r_[precision[sl], 1], np.r_[recall[sl], 0], y_scores[sl]

(array([0.66666667, 0.5       , 1.        , 1.        ]),
array([1. , 0.5, 0.5, 0. ]), array([0.35, 0.4 , 0.8 ]))
```
Last point, the length of precision and recall is 4 because values of precision equal to 1 and recall equal to 0 are concatenated to the obtained arrays in order to let the precision-recall curve start in correspondence of the y-axis.

来源：https://stackoverflow.com/questions/60865028/sklearn-precision-recall-curve-and-threshold

标签

scikit-learn

precision

precision-recall