问题
I would like to know if there is a way to plot the average ROC Curve from the cross-validation data of a SVM-RFE model generated with the caret
package.
My results are:
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD Selected
1 0.6911 0.0000 1.0000 0.5900 0.0000 0.2186 0.0000 0.0000 0.0303 0.0000
2 0.7600 0.3700 0.8067 0.6280 0.1807 0.1883 0.3182 0.2139 0.1464 0.3295
3 0.7267 0.4233 0.8667 0.6873 0.3012 0.2020 0.3216 0.1905 0.1516 0.3447
4 0.6989 0.3867 0.8600 0.6680 0.2551 0.2130 0.3184 0.1793 0.1458 0.3336
5 0.7000 0.3367 0.8600 0.6473 0.2006 0.2073 0.3359 0.1793 0.1588 0.3672
6 0.7167 0.3833 0.8200 0.6427 0.2105 0.1909 0.3338 0.2539 0.1682 0.3639
7 0.7122 0.3767 0.8333 0.6487 0.2169 0.1784 0.3226 0.2048 0.1642 0.3702
8 0.7144 0.4233 0.7933 0.6440 0.2218 0.2017 0.3454 0.2599 0.1766 0.3770
9 0.8356 0.6533 0.7867 0.7300 0.4363 0.1706 0.3415 0.2498 0.1997 0.4209
10 0.8811 0.6867 0.8200 0.7647 0.5065 0.1650 0.3134 0.2152 0.1949 0.4053 *
11 0.8700 0.6933 0.8133 0.7627 0.5046 0.1697 0.3183 0.2147 0.1971 0.4091
12 0.8678 0.6967 0.7733 0.7407 0.4682 0.1579 0.3153 0.2559
...
The top 5 variables (out of 10):
SumAverage_GLCM_R1SC4NG2, Variance_GLCM_R1SC4NG2, HGZE_GLSZM_R1SC4NG2, LGZE_GLSZM_R1SC4NG2, SZLGE_GLSZM_R1SC4NG2
I have tried with the solution mentioned here: ROC curve from training data in caret
optSize <- svmRFE_NG2$optsize
selectedIndices <- svmRFE_NG2$pred$Variables == optSize
plot.roc(svmRFE_NG2$pred$obs[selectedIndices],
svmRFE_NG2$pred$LUNG[selectedIndices])
But this solution seems not to work (the resulting AUC value is quite different). I have separated the results of the training process into the 50 cross-validation sets, as mentioned in the previous answer, but I do not know what to do next.
resamples<-split(svmRFE_NG2$pred,svmRFE_NG2$pred$Variables)
resamplesFOLD<-split(resamples[[optSize]],resamples[[optSize]]$Resample)
Any ideas?
回答1:
As you already did you can a) enable savePredictions = T
in the trainControl
parameter of caret::train
, then, b) from the trained model object, use the pred
variable - which contains all predictions over all partitions and resamples - to compute whichever ROC curve you would like to look at. You now have multiple options of which ROC this can be, e.g.:
you could look at all predictions over all partitions and resamples at once:
plot(roc(predictor = modelObject$pred$CLASSNAME, response = modelObject$pred$obs))
Or you could do this over individual partitions and/or resamples (which is what you tried above). The following example computes the ROC curve per partition and resample, so with 10 partitions and 5 repeats will result in 50 ROC curves:
library(plyr)
l_ply(split(modelObject$pred, modelObject$pred$Resample), function(d) {
plot(roc(predictor = d$CLASSNAME, response = d$obs))
})
Depending on your data and model, the latter will give you certain variance in the resulting ROC curves and AUC values. You can see the same variance in the AUC
and SD
values caret
calculated for your individual partitions and resamples, so this results from your data and model and is correct.
BTW: I was using the pROC::roc
function for calculating the examples above, but you could use any suitable function here. And, when using caret::train
obtaining the ROC is always the same, no matter the model type.
回答2:
I know this post is old but I have the same issue trying to understand why I get different results when calculating the ROC value from each resample and when I am calculating the ROC value using all predictions and resamples at once. Which method for calculating ROC is correct?
(Apologies for posting this as a new answer but I am not allowed to post a comment.)
来源:https://stackoverflow.com/questions/37215366/plot-roc-curve-from-cross-validation-training-data-in-r