I have been trying to apply recursive feature selection using caret package. What I need is that ref uses the AUC as performance measure. After googling for a month I cannot get the process working. Here is the code I have used:
library(caret)
library(doMC)
registerDoMC(cores = 4)
data(mdrr)
subsets <- c(1:10)
ctrl <- rfeControl(functions=caretFuncs,
method = "cv",
repeats =5, number = 10,
returnResamp="final", verbose = TRUE)
trainctrl <- trainControl(classProbs= TRUE)
caretFuncs$summary <- twoClassSummary
set.seed(326)
rf.profileROC.Radial <- rfe(mdrrDescr, mdrrClass, sizes=subsets,
rfeControl=ctrl,
method="svmRadial",
metric="ROC",
trControl=trainctrl)
When executing this script I get the following results:
Recursive feature selection
Outer resampling method: Cross-Validation (10 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.7501 0.4796 0.04324 0.09491
2 0.7671 0.5168 0.05274 0.11037
3 0.7671 0.5167 0.04294 0.09043
4 0.7728 0.5289 0.04439 0.09290
5 0.8012 0.5856 0.04144 0.08798
6 0.8049 0.5926 0.02871 0.06133
7 0.8049 0.5925 0.03458 0.07450
8 0.8124 0.6090 0.03444 0.07361
9 0.8181 0.6204 0.03135 0.06758 *
10 0.8069 0.5971 0.04234 0.09166
342 0.8106 0.6042 0.04701 0.10326
The top 5 variables (out of 9):
nC, X3v, Sp, X2v, X1v
The process always uses Accuracy as performance mesure. Another problem that arises is that when I try to get prediction from the model obtained using:
predictions <- predict(rf.profileROC.Radial$fit,mdrrDescr)
I get the following message
In predictionFunction(method, modelFit, tempX, custom = models[[i]]$control$custom$prediction) :
kernlab class prediction calculations failed; returning NAs
turning out to be imposible to get some prediction from the model.
Here is the information obtained through sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8
[4] LC_COLLATE=es_ES.UTF-8 LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] e1071_1.6-2 class_7.3-9 pROC_1.6.0.1 doMC_1.3.2 iterators_1.0.6 foreach_1.4.1
[7] caret_6.0-21 ggplot2_0.9.3.1 lattice_0.20-24 kernlab_0.9-19
loaded via a namespace (and not attached):
[1] car_2.0-19 codetools_0.2-8 colorspace_1.2-4 compiler_3.0.2 dichromat_2.0-0
[6] digest_0.6.4 gtable_0.1.2 labeling_0.2 MASS_7.3-29 munsell_0.4.2
[11] nnet_7.3-7 plyr_1.8 proto_0.3-10 RColorBrewer_1.0-5 Rcpp_0.10.6
[16] reshape2_1.2.2 scales_0.2.3 stringr_0.6.2 tools_3.0.2
One problem is a minor typo ('trControl='
instead of 'trainControl='
). Also, you change caretFuncs
after you attached it to rfe
's control function. Lastly, you will need to tell trainControl
to calculate the ROC curves.
This code works:
caretFuncs$summary <- twoClassSummary
ctrl <- rfeControl(functions=caretFuncs,
method = "cv",
repeats =5, number = 10,
returnResamp="final", verbose = TRUE)
trainctrl <- trainControl(classProbs= TRUE,
summaryFunction = twoClassSummary)
rf.profileROC.Radial <- rfe(mdrrDescr, mdrrClass,
sizes=subsets,
rfeControl=ctrl,
method="svmRadial",
## I also added this line to
## avoid a warning:
metric = "ROC",
trControl = trainctrl)
> rf.profileROC.Radial
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables ROC Sens Spec ROCSD SensSD SpecSD Selected
1 0.7805 0.8356 0.6304 0.08139 0.10347 0.10093
2 0.8340 0.8491 0.6609 0.06955 0.10564 0.09787
3 0.8412 0.8491 0.6565 0.07222 0.10564 0.09039
4 0.8465 0.8491 0.6609 0.06581 0.09584 0.10207
5 0.8502 0.8624 0.6652 0.05844 0.08536 0.09404
6 0.8684 0.8923 0.7043 0.06222 0.06893 0.09999
7 0.8642 0.8691 0.6913 0.05655 0.10837 0.06626
8 0.8697 0.8823 0.7043 0.05411 0.08276 0.07333
9 0.8792 0.8753 0.7348 0.05414 0.08933 0.07232 *
10 0.8622 0.8826 0.6696 0.07457 0.08810 0.16550
342 0.8650 0.8926 0.6870 0.07392 0.08140 0.17367
The top 5 variables (out of 9):
nC, X3v, Sp, X2v, X1v
For the prediction problems, you should use rf.profileROC.Radial
instead of the fit
component:
> predict(rf.profileROC.Radial, head(mdrrDescr))
pred Active Inactive
1 Inactive 0.4392768 0.5607232
2 Active 0.6553482 0.3446518
3 Active 0.6387261 0.3612739
4 Inactive 0.3060582 0.6939418
5 Active 0.6661557 0.3338443
6 Active 0.7513180 0.2486820
Max
来源:https://stackoverflow.com/questions/21088825/feature-selection-in-caret-rfe-sum-with-roc