问题
I am working with cross-validation data (10-fold repeated 5 times) from a SVM-RFE model generated with the caret
package. I know that caret
package works with pROC
package when computing metrics but I need to use ROCR
package in order to obtain the average ROC. However, I noticed that the average AUC values were not the same when using each package, so I am not sure if I should use both packages indistinctively.
The code I used to prove that is:
predictions_NG3<-list()
labels_NG3<-list()
optSize <- svmRFE_NG3$optsize
resamples<-(split(svmRFE_NG3$pred,svmRFE_NG3$pred$Variables))
resamplesFOLD<-(split(resamples[[optSize]],resamples[[optSize]]$Resample))
auc_pROC <- vector()
auc_ROCR <- vector()
for (i in 1:50){
predictions_NG3[[i]]<-resamplesFOLD[[i]]$LUNG
labels_NG3[[i]]<-resamplesFOLD[[i]]$obs
#WITH pROC
rocCurve <- roc(response = labels_NG3[[i]],
predictor = predictions_NG3[[i]],
levels = c("BREAST","LUNG")) #LUNG POSITIVE
auc_pROC <- c(auc_pROC,auc(rocCurve))
#WITH ROCR
pred_ROCR <- prediction(predictions_NG3[[i]], labels_NG3[[i]],
label.ordering = c("BREAST","LUNG")) #LUNG POSITIVE
auc_ROCR <- c(auc_ROCR,performance(pred_ROCR,"auc")@y.values[[1]])
}
auc_mean_pROC <- mean(auc_pROC)
auc_sd_pROC <- sd(auc_pROC)
auc_mean_ROCR <- mean(auc_ROCR)
auc_sd_ROCR <- sd(auc_ROCR)
The results are slightly different:
auc_mean_pROC auc_sd_pROC auc_mean_ROCR auc_sd_ROCR
1 0.8755556 0.1524801 0.8488889 0.2072751
I noticed that the average AUC computation is giving me different results in many cases, like in [5]
, [22]
and [25]
:
> auc_pROC
[1] 0.8333333 0.8333333 1.0000000 1.0000000 0.6666667 0.8333333 0.3333333 0.8333333 1.0000000 1.0000000 1.0000000 1.0000000
[13] 0.8333333 0.5000000 0.8888889 1.0000000 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 0.6666667 0.6666667 0.8888889
[25] 0.8333333 0.6666667 1.0000000 0.6666667 1.0000000 0.6666667 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 1.0000000
[37] 0.8333333 1.0000000 0.8333333 1.0000000 0.8333333 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000
[49] 1.0000000 1.0000000
> auc_ROCR
[1] 0.8333333 0.8333333 1.0000000 1.0000000 0.3333333 0.8333333 0.3333333 0.8333333 1.0000000 1.0000000 1.0000000 1.0000000
[13] 0.8333333 0.5000000 0.8888889 1.0000000 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 0.3333333 0.6666667 0.8888889
[25] 0.1666667 0.6666667 1.0000000 0.6666667 1.0000000 0.6666667 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 1.0000000
[37] 0.8333333 1.0000000 0.8333333 1.0000000 0.8333333 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000
[49] 1.0000000 1.0000000
I have tried with other SVM-RFE models, but the problem remains. Why is this happening? Am I doing something wrong?
回答1:
By default, the roc
function in pROC attempts to detect what is the response level of control and case observations (you overrode the defaults by setting the levels
argument) and whether the controls should have higher or lower values than the cases. You haven't used a direction
argument to set the latter.
When you resample your data, this auto-detection will happen for every sample. And if your sample size is low, or your AUC close to 0.5, it can and will happen that some ROC curves will be generated with the opposite direction, biasing your average towards higher values.
Therefore you should always set the direction
argument explicitly when you resample ROC curves or similar, for instance:
rocCurve <- roc(response = labels_NG3[[i]],
predictor = predictions_NG3[[i]],
direction = "<",
levels = c("BREAST","LUNG"))
来源:https://stackoverflow.com/questions/37252317/difference-in-average-auc-computation-using-rocr-and-proc-r