问题
I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example
import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
which then gives
>>>precision
array([0.66666667, 0.5 , 1. , 1. ])
>>> recall
array([1. , 0.5, 0.5, 0. ])
>>> thresholds
array([0.35, 0.4 , 0.8 ])
Could someone explain to me how to get those recalls and precisions by showing me what is computed?
回答1:
I know I am a bit late here, but I had a similar doubt that the link you provided has cleared up. Roughly speaking, here is what happens inside precision_recall_curve()
following sklearn
implementation.
Decision scores are ordered in descending order and labels according to the just obtained order:
desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1] y_scores = y_scores[desc_score_indices] y_true = y_true[desc_score_indices]
You'll get:
y_scores, y_true (array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
sklearn
implementation then foresees to exclude the duplicated values ofy_scores
(no duplicates in this example).distinct_value_indices = np.where(np.diff(y_scores))[0] threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
Due to the absence of duplicates you'll get:
distinct_value_indices, threshold_idxs (array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
Finally you can compute the number of true positives and false positives through which you can in turn compute precision and recall.
# tps at index i being the number of positive samples assigned a score >= thresholds[i] tps = np.cumsum(y_true)[threshold_idxs] # fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps fps = np.cumsum(1 - y_true)[threshold_idxs] y_scores = y_scores[threshold_idxs]
After this steps you'll have two arrays with the number of true positives and false positives per considered score.
tps, fps (array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))
Eventually, you can compute precision and recall.
precision = tps / (tps + fps) # tps[-1] being the total number of positive samples recall = tps / tps[-1] precision, recall (array([1. , 0.5 , 0.66666667, 0.5 ]), array([0.5, 0.5, 1. , 1. ]))
An important point that causes the
thresholds
array to be shorter than they_score
one (even though there are no duplicates iny_score
) is the one that was pointed out within the link you referenced. Basically, the index of the first occurrence ofrecall
equal to 1 defines the length of thethresholds
array (index 2 here, corresponding to length=3 and reason why the length ofthresholds
is 3).last_ind = tps.searchsorted(tps[-1]) # 2 sl = slice(last_ind, None, -1) # from index 2 to 0 precision, recall, thresholds = np.r_[precision[sl], 1], np.r_[recall[sl], 0], y_scores[sl] (array([0.66666667, 0.5 , 1. , 1. ]), array([1. , 0.5, 0.5, 0. ]), array([0.35, 0.4 , 0.8 ]))
Last point, the length of
precision
andrecall
is 4 because values of precision equal to 1 and recall equal to 0 are concatenated to the obtained arrays in order to let the precision-recall curve start in correspondence of the y-axis.
来源:https://stackoverflow.com/questions/60865028/sklearn-precision-recall-curve-and-threshold