问题
I'm wondering if there is a way to specify which class of the outcome variable is positive in caret's train()
function. A minimal example:
# Settings
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE, summaryFunction = twoClassSummary, classProbs = TRUE)
# Data
data <- mtcars %>% mutate(am = factor(am, levels = c(0,1), labels = c("automatic", "manual"), ordered = T))
# Train
set.seed(123)
model1 <- train(am ~ disp + wt, data = data, method = "glm", family = "binomial", trControl = ctrl, tuneLength = 5)
# Data (factor ordering switched)
data <- mtcars %>% mutate(am = factor(am, levels = c(1,0), labels = c("manual", "automatic"), ordered = T))
# Train
set.seed(123)
model2 <- train(am ~ disp + wt, data = data, method = "glm", family = "binomial", trControl = ctrl, tuneLength = 5)
# Specifity and Sensitivity is switched
model1
model2
If you run the code, you'll notice that Specificity and Sensitivity metrics are "switched" in both models. It looks like the train()
function takes the first level of a factor outcome variable as a positive outcome. Is there a way to specify a positive class in the function itself so I will get the same results no matter of the outcome factor ordering? I tried adding positive = "manual"
but this results in an error.
回答1:
I believe @Johannes is the example of over-engineering a simple process.
Simply revert the order of your factor:
df$target <- factor(df$target, levels=rev(levels(df$target)))
回答2:
The issue lies not in function train()
but in function twoClassSummary
, which looks like this:
function (data, lev = NULL, model = NULL)
{
lvls <- levels(data$obs)
[...]
out <- c(rocAUC,
sensitivity(data[, "pred"], data[, "obs"],
lev[1]), # Hard coded positive class
specificity(data[, "pred"], data[, "obs"],
lev[2])) # Hard coded negative class
names(out) <- c("ROC", "Sens", "Spec")
out
}
The order of the levels in which they are passed to sensitivity()
and specificity()
is hard-coded here.
As @Seymour points out very correctly, reversing the order of the levels of the outcome variable fixes the issue.
df$target <- factor(df$target, levels=rev(levels(df$target)))
If you are not willing to change the order of levels, there's an unintrusive way to change the twoClassSummary() function.
sensitivity()
and specificity()
take the positive
and negative
level name, respectively, (a suboptimal design choice). So we include these two arguments into our custom function.
Further down, we pass these arguments to the respective function to fix the problem.
customTwoClassSummary <- function(data, lev = NULL, model = NULL, positive = NULL, negative=NULL)
{
lvls <- levels(data$obs)
if (length(lvls) > 2)
stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
caret:::requireNamespaceQuietStop("ModelMetrics")
if (!all(levels(data[, "pred"]) == lvls))
stop("levels of observed and predicted data do not match")
rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0,
1), data[, lvls[1]])
out <- c(rocAUC,
# Only change happens here!
sensitivity(data[, "pred"], data[, "obs"], positive=positive),
specificity(data[, "pred"], data[, "obs"], negative=negative))
names(out) <- c("ROC", "Sens", "Spec")
out
}
But how to specify these options without changing more code within the package? By default caret
doesn't pass options to the summary function. We wrap the function up in an anonymous function in the call to trainControl()
:
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE,
# This is a trick how to fix arguments for a function call
summaryFunction = function(...) customTwoClassSummary(...,
positive = "manual", negative="automatic"),
classProbs = TRUE)
The ...
argument makes sure that all other arguments that caret
passes to the anonymous function get passed on to customTwoClassSummary()
.
来源:https://stackoverflow.com/questions/45333029/specifying-positive-class-of-an-outcome-variable-in-caret-train