Custom Performance Function in caret Package using predicted Probability

独自空忆成欢 提交于 2021-02-10 10:40:48

问题


This SO post is about using a custom performance measurement function in the caret package. You want to find the best prediction model, so you build several and compare them by calculating a single metric that is drawn from comparing the observation and the predicted value. There are default functions to calculate this metric, but you can also define your own metric-function. This custom functions must take obs and predicted values as input.

In classification problems (let's say only two classes) the predicted value is 0 or 1. However, what I need to evaluate is also the probability calculated in the model. Is there any way to achieve this?

The reason is that there are applications where you need to know whether a 1 prediction is actually with a 99% probability or with a 51% probability - not just if the prediction is 1 or 0.

Can anyone help?


Edit OK, so let me try to explain a little bit better. In the documentation of the caret package under 5.5.5 (Alternate Performance Metrics) there is a description how to use your own custom performance function like so

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using 
                           ## the following function
                           summaryFunction = twoClassSummary)

twoClassSummary is the custom performance function in this example. The function provided here needs to take as input a dataframe or matrix with obs and pred. And here's the point - I want to use a function that does not take observerd and predicted, but observed and predicted probability.


One more thing:

Solutions from other packages are also welcome. The only thing I am not looking for is "This is how you write your own cross-validation function."


回答1:


Caret does support passing class probabilities to custom summary functions when you specify classProbs = TRUE in trainControl. In that case the data argument when creating a custom summary function will have additional two columns named as classes containing the probability of each class. Names of these classes will be in the lev argument which is a vector of length 2.

See the Example:

library(caret)
library(mlbench)
data(Sonar)

Custom summary LogLoss:

LogLoss <- function (data, lev = NULL, model = NULL){ 
  obs <- data[, "obs"] #truth
  cls <- levels(obs) #find class names
  probs <- data[, cls[2]] #use second class name to extract probs for 2nd clas
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability, this line and bellow is just logloss calculation, irrelevant for your question 
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLoss") #important since this is specified in call to train. Output can be a named vector of multiple values. 
  out
}

fitControl <- trainControl(method = "cv",
                           number = 5,
                           classProbs = TRUE,
                           summaryFunction = LogLoss)


fit <-  train(Class ~.,
             data = Sonar,
             method = "rpart", 
             metric = "LogLoss" ,
             tuneLength = 5,
             trControl = fitControl,
             maximize = FALSE) #important, depending on calculated performance measure

fit
#output
CART 

208 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 166, 166, 166, 167, 167 
Resampling results across tuning parameters:

  cp          LogLoss  
  0.00000000  1.1220902
  0.01030928  1.1220902
  0.05154639  1.1017268
  0.06701031  1.0694052
  0.48453608  0.6405134

LogLoss was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.4845361.

Alternatively use the lev argument which contains the class levels and define some error checking

LogLoss <- function (data, lev = NULL, model = NULL){ 
 if (length(lev) > 2) {
        stop(paste("Your outcome has", length(lev), "levels. The LogLoss() function isn't appropriate."))
    }
  obs <- data[, "obs"] #truth
  probs <- data[, lev[2]] #use second class name
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLoss")
  out
}

Check out this section of caret book: https://topepo.github.io/caret/model-training-and-tuning.html#metrics

for additional info. Great book to read if you plan on using caret and even if you are not its a good read.




回答2:


Sadly, I just found the answer to my question. There is this one little sentence in the caret documentation...

"...If none of these parameters are satisfactory, the user can also compute custom performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:

data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values (either numeric data for regression or character values for classification). Currently, class probabilities are not passed to the function. The values in data are the held-out predictions (and their associated reference values) for a single combination of tuning..."

For the sake of documentation: This is written on 2020-07-03 with caret package documentation from 2019-03-27.




回答3:


I am not really sure I understand your question correctly:

To receive predicted probabilities from a model mdl, you can use predict(mdl, type = "prob"). I.e.,

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2

df <- iris
df$isSetosa <- factor(df$Species == "setosa", levels = c(FALSE, TRUE), labels = c("not-setosa", "is-setosa"))
df$Species <- NULL

mdl <- train(isSetosa ~ ., data = df, method = "glm",
                family = "binomial",
                trControl = trainControl(method = "cv"))

preds <- predict(mdl, newdata = df, type = "prob")
head(preds)
#>     not-setosa is-setosa
#> 1 2.220446e-16         1
#> 2 2.220446e-16         1
#> 3 2.220446e-16         1
#> 4 1.875722e-12         1
#> 5 2.220446e-16         1
#> 6 2.220446e-16         1

Created on 2020-07-02 by the reprex package (v0.3.0)

I.e., we see that case 4 is predicted to be a setosa with ~100% (tbh, this toy model is way too good to be true)...

Now we can create a custom function that collapses the values to a single metric.

true <- df$isSetosa

# very basic model metrics that just sums the absolute differences in true - probability
custom_model_metric <- function(preds, true) {
  d <- data.frame(true = true)
  tt <- predict(dummyVars(~true, d), d)
  colnames(tt) <- c("not-setosa", "is-setosa")
  
  sum(abs(tt - preds))
}

custom_model_metric(preds, true)
#> [1] 3.294029e-09

Created on 2020-07-02 by the reprex package (v0.3.0)



来源:https://stackoverflow.com/questions/62658672/custom-performance-function-in-caret-package-using-predicted-probability

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!