I am aware of the concept of Precision as well as the concept of Recall. But I am finding it very hard to understand the idea of a \'threshold\' which makes any P-R curve possib
First of all you should remove the 'roc' and 'auc' tags as precision-recall curve is something different:
ROC Curves:
- x-axis: False Positive Rate FPR = FP /(FP + TN) = FP / N
- y-axis: True Positive Rate TPR = Recall = TP /(TP + FN) = TP / P
Precision-Recall Curves:
- x-axis: Recall = TP / (TP + FN) = TP / P = TPR
- y-axis: Precision = TP / (TP + FP) = TP / PP
Your cancer detection example is a binary classification problem. Your predictions are based on a probability. The probability of (not) having cancer.
In general, an instance would be classified as A, if P(A) > 0.5 (your threshold value). For this value, you get your Recall-Precision pair based on the True Positives, True Negatives, False Positives and False Negatives.
Now, as you change your 0.5 threshold, you get a different result (different pair). You can already classify a patient as 'has cancer' for P(A) > 0.3. This will decrease Precision and increase Recall. You would rather tell someone that he has cancer even though he has not, to make sure that patients with cancer are sure to get the treatment they need. This represents the intuitive trade-off between TPR and FPR or Precision and Recall or Sensitivity and Specificity.
Let's add these terms as you see them more often common in biostatistics.
- Sensitivity = TP / P = Recall = TPR
- Specificity = TN / N = (1 – FPR)
ROC-curves and Precision-Recall curves visualize all these possible thresholds of your classifier.
You should consider these metrics, if accuracy alone is not a suitable quality measure. Classifying all patients as 'does not have cancer' will give you the highest accuracy but the values of your ROC and Precision-Recall curves will be 1s and 0s.