Calculate AUC in R?

前端 未结 10 1193
感动是毒
感动是毒 2020-12-07 09:45

Given a vector of scores and a vector of actual class labels, how do you calculate a single-number AUC metric for a binary classifier in the R language or in simple English?

相关标签:
10条回答
  • 2020-12-07 10:07

    As mentioned by others, you can compute the AUC using the ROCR package. With the ROCR package you can also plot the ROC curve, lift curve and other model selection measures.

    You can compute the AUC directly without using any package by using the fact that the AUC is equal to the probability that a true positive is scored greater than a true negative.

    For example, if pos.scores is a vector containing a score of the positive examples, and neg.scores is a vector containing the negative examples then the AUC is approximated by:

    > mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T))
    [1] 0.7261
    

    will give an approximation of the AUC. You can also estimate the variance of the AUC by bootstrapping:

    > aucs = replicate(1000,mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T)))
    
    0 讨论(0)
  • 2020-12-07 10:08

    Currently top voted answer is incorrect, because it disregards ties. When positive and negative scores are equal, then AUC should be 0.5. Below is corrected example.

    computeAUC <- function(pos.scores, neg.scores, n_sample=100000) {
      # Args:
      #   pos.scores: scores of positive observations
      #   neg.scores: scores of negative observations
      #   n_samples : number of samples to approximate AUC
    
      pos.sample <- sample(pos.scores, n_sample, replace=T)
      neg.sample <- sample(neg.scores, n_sample, replace=T)
      mean(1.0*(pos.sample > neg.sample) + 0.5*(pos.sample==neg.sample))
    }
    
    0 讨论(0)
  • 2020-12-07 10:12

    With the package pROC you can use the function auc() like this example from the help page:

    > data(aSAH)
    > 
    > # Syntax (response, predictor):
    > auc(aSAH$outcome, aSAH$s100b)
    Area under the curve: 0.7314
    
    0 讨论(0)
  • 2020-12-07 10:12

    I found some of the solutions here to be slow and/or confusing (and some of them don't handle ties correctly) so I wrote my own data.table based function auc_roc() in my R package mltools.

    library(data.table)
    library(mltools)
    
    preds <- c(.1, .3, .3, .9)
    actuals <- c(0, 0, 1, 1)
    
    auc_roc(preds, actuals)  # 0.875
    
    auc_roc(preds, actuals, returnDT=TRUE)
       Pred CountFalse CountTrue CumulativeFPR CumulativeTPR AdditionalArea CumulativeArea
    1:  0.9          0         1           0.0           0.5          0.000          0.000
    2:  0.3          1         1           0.5           1.0          0.375          0.375
    3:  0.1          1         0           1.0           1.0          0.500          0.875
    
    0 讨论(0)
  • 2020-12-07 10:16

    You can learn more about AUROC in this blog post by Miron Kursa:

    https://mbq.me/blog/augh-roc/

    He provides a fast function for AUROC:

    # By Miron Kursa https://mbq.me
    auroc <- function(score, bool) {
      n1 <- sum(!bool)
      n2 <- sum(bool)
      U  <- sum(rank(score)[!bool]) - n1 * (n1 + 1) / 2
      return(1 - U / n1 / n2)
    }
    

    Let's test it:

    set.seed(42)
    score <- rnorm(1e3)
    bool  <- sample(c(TRUE, FALSE), 1e3, replace = TRUE)
    
    pROC::auc(bool, score)
    mltools::auc_roc(score, bool)
    ROCR::performance(ROCR::prediction(score, bool), "auc")@y.values[[1]]
    auroc(score, bool)
    
    0.51371668847094
    0.51371668847094
    0.51371668847094
    0.51371668847094
    

    auroc() is 100 times faster than pROC::auc() and computeAUC().

    auroc() is 10 times faster than mltools::auc_roc() and ROCR::performance().

    print(microbenchmark(
      pROC::auc(bool, score),
      computeAUC(score[bool], score[!bool]),
      mltools::auc_roc(score, bool),
      ROCR::performance(ROCR::prediction(score, bool), "auc")@y.values,
      auroc(score, bool)
    ))
    
    Unit: microseconds
                                                                 expr       min
                                               pROC::auc(bool, score) 21000.146
                                computeAUC(score[bool], score[!bool]) 11878.605
                                        mltools::auc_roc(score, bool)  5750.651
     ROCR::performance(ROCR::prediction(score, bool), "auc")@y.values  2899.573
                                                   auroc(score, bool)   236.531
             lq       mean     median        uq        max neval  cld
     22005.3350 23738.3447 22206.5730 22710.853  32628.347   100    d
     12323.0305 16173.0645 12378.5540 12624.981 233701.511   100   c 
      6186.0245  6495.5158  6325.3955  6573.993  14698.244   100  b  
      3019.6310  3300.1961  3068.0240  3237.534  11995.667   100 ab  
       245.4755   253.1109   251.8505   257.578    300.506   100 a   
    
    0 讨论(0)
  • 2020-12-07 10:17

    The ROCR package will calculate the AUC among other statistics:

    auc.tmp <- performance(pred,"auc"); auc <- as.numeric(auc.tmp@y.values)
    
    0 讨论(0)
提交回复
热议问题