问题
I would like to calculate the 10-fold cross-validated AUC of an elastic net regression model with the optimal alpha and lambda using caret::train
https://stats.stackexchange.com/questions/69638/does-caret-train-function-for-glmnet-cross-validate-for-both-alpha-and-lambda/69651 explains how to cross-validate alpha and lambda with caret::train
My question on Cross Validated got closed, because it has been classified as a programming question: https://stats.stackexchange.com/questions/505865/r-calculate-the-10-fold-crossvalidated-auc-with-glmnet-and-given-alpha-and-lamb?noredirect=1#comment934491_505865
What I have
Dataset:
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)
# example data
data(PimaIndiansDiabetes, package="mlbench")
# make a training set
set.seed(2323)
train.data <- PimaIndiansDiabetes
My model:
# build a model using the training set
set.seed(2323)
model <- train(
diabetes ~., data = train.data, method = "glmnet",
trControl = trainControl("cv",
number = 10,
classProbs = TRUE,
savePredictions = TRUE),
tuneLength = 10,
metric="ROC"
)
Here I get the error:
Warning message:
In train.default(x, y, weights = w, ...) :
The metric "ROC" was not in the result set. Accuracy will be used instead.
If I ignore the error the best alpha and lambda would be:
model$bestTune
alpha lambda
11 0.2 0.002926378
Now I would like to get a 10-fold cross-validated AUC using my model with the best alpha and lambda and the train data.
What I tried
My approach would be something like this, however, I get the error: Something is wrong; all the Accuracy metric values are missing:
model <- train(
diabetes ~., data = train.data, method = "glmnet",
trControl = trainControl("cv",
number = 10,
classProbs = TRUE,
savePredictions = TRUE),
alpha=model$bestTune$alpha,
lambda=model$bestTune$lambda,
tuneLength = 10,
metric="ROC"
)
How could I calculate a cross-validated AUC using the optimal alpha and lambda and the train data?
I am still not sure how to cross-validate for AUC not, Accuracy.
Thank you for your help.
回答1:
You intend to use "ROC" - area under the ROC curve to pick the best tuning parameters but you do not specify twoClassSummary() which holds this metric. This is what the warning is informing you
Warning message:
In train.default(x, y, weights = w, ...) :
The metric "ROC" was not in the result set. Accuracy will be used instead.
Perform turning:
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)
data(PimaIndiansDiabetes, package="mlbench")
set.seed(2323)
train.data <- PimaIndiansDiabetes
set.seed(2323)
model <- train(
diabetes ~., data = train.data, method = "glmnet",
trControl = trainControl("cv",
number = 10,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary),
tuneLength = 10,
metric="ROC" #ROC metric is in twoClassSummary
)
Since you specified classProbs = TRUE
and savePredictions = TRUE
you can calculate any metric based on the predictions.
The calculate accuracy:
model$pred %>%
filter(alpha == model$bestTune$alpha, #filter predictions for best tuning parameters
lambda == model$bestTune$lambda) %>%
group_by(Resample) %>% #group by fold
summarise(acc = sum(pred == obs)/n()) #calculate metric
#output
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
Resample acc
<chr> <dbl>
1 Fold01 0.740
2 Fold02 0.753
3 Fold03 0.818
4 Fold04 0.776
5 Fold05 0.779
6 Fold06 0.753
7 Fold07 0.766
8 Fold08 0.792
9 Fold09 0.727
10 Fold10 0.789
This gives you per fold metric. To get the average performance
model$pred %>%
filter(alpha == model$bestTune$alpha,
lambda == model$bestTune$lambda) %>%
group_by(Resample) %>%
summarise(acc = sum(pred == obs)/n()) %>%
pull(acc) %>%
mean
#output
0.769566
When ROC is used as a selection metric the hyper parameters are optimized over all decision thresholds. In many cases the chosen model would preform suboptimal using the default decision threshold of 0.5.
Caret has a function thresholder()
it will calculate many metrics based on the resampled data over specified decision thresholds.
thresholder(model, seq(0, 1, length.out = 10)) #in reality I would use length.out = 100
#output
alpha lambda prob_threshold Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy Accuracy
1 0.1 0.03607775 0.0000000 1.000 0.00000000 0.6510595 NaN 0.6510595 1.000 0.7886514 0.6510595 0.6510595 1.0000000 0.5000000 0.6510595
2 0.1 0.03607775 0.1111111 0.994 0.02621083 0.6557464 0.7380952 0.6557464 0.994 0.7901580 0.6510595 0.6471463 0.9869617 0.5101054 0.6562714
3 0.1 0.03607775 0.2222222 0.986 0.15270655 0.6850874 0.8711111 0.6850874 0.986 0.8082906 0.6510595 0.6419344 0.9375256 0.5693533 0.6952837
4 0.1 0.03607775 0.3333333 0.964 0.32421652 0.7278778 0.8406807 0.7278778 0.964 0.8290127 0.6510595 0.6276316 0.8633459 0.6441083 0.7408578
5 0.1 0.03607775 0.4444444 0.928 0.47364672 0.7674158 0.7903159 0.7674158 0.928 0.8395895 0.6510595 0.6041866 0.7877990 0.7008234 0.7695147
6 0.1 0.03607775 0.5555556 0.862 0.59002849 0.7970454 0.7053968 0.7970454 0.862 0.8274687 0.6510595 0.5611928 0.7043575 0.7260142 0.7669686
7 0.1 0.03607775 0.6666667 0.742 0.75740741 0.8521972 0.6114289 0.8521972 0.742 0.7926993 0.6510595 0.4830827 0.5677204 0.7497037 0.7473855
8 0.1 0.03607775 0.7777778 0.536 0.90284900 0.9156149 0.5113452 0.9156149 0.536 0.6739140 0.6510595 0.3489918 0.3828606 0.7194245 0.6640636
9 0.1 0.03607775 0.8888889 0.198 0.98119658 0.9573810 0.3967404 0.9573810 0.198 0.3231917 0.6510595 0.1289474 0.1354751 0.5895983 0.4713602
10 0.1 0.03607775 1.0000000 0.000 1.00000000 NaN 0.3489405 NaN 0.000 NaN 0.6510595 0.0000000 0.0000000 0.5000000 0.3489405
Kappa J Dist
1 0.0000000 0.00000000 1.0000000
2 0.0258717 0.02021083 0.9738516
3 0.1699809 0.13870655 0.8475624
4 0.3337322 0.28821652 0.6774055
5 0.4417759 0.40164672 0.5329805
6 0.4692998 0.45202849 0.4363768
7 0.4727251 0.49940741 0.3580090
8 0.3726156 0.43884900 0.4785352
9 0.1342372 0.17919658 0.8026597
10 0.0000000 0.00000000 1.0000000
Now pick a threshold based on your desired metric and use that. Usually the metrics used with imbalanced data Cohen's Kappa, Youden's J or Matthews correlation coefficient (MCC). Here is a decent paper on the matter.
Please note that since this data was used to find the optimal threshold the performance obtained this way will be optimistically biased. To evaluate the performance of the picked decision threshold it would be best to use several independent test sets. In other words I recommend nested resampling where you would optimize the parameters and threshold using the inner folds and evaluate on the outer folds.
Here is an explanation on how to use nested resampling with caret with regression. Some modifications are needed to make it work with classification with optimized threshold.
Please note that this is not the only way to pick the best decision threshold. Another way is to pick the desired metric a priori (MCC for instance) and treat the decision threshold as a hyper parameter which is to be tuned jointly with all the other hyper parameters. I trust this is not supported with caret with creating custom models.
来源:https://stackoverflow.com/questions/65814703/r-can-carettrain-function-for-glmnet-cross-validate-auc-at-fixed-alpha-and-la