How to directly plot ROC of h2o model object in R

后端 未结 4 1562
灰色年华
灰色年华 2021-01-06 00:33

My apologies if I\'m missing something obvious. I\'ve been thoroughly enjoying working with h2o in the last few days using R interface. I would like to evaluate my model, sa

相关标签:
4条回答
  • 2021-01-06 01:01

    There is not currently a function in H2O R or Python client to plot the ROC curve directly. The roc method in Python returns the data neccessary to plot the ROC curve, but does not plot the curve itself. ROC curve plotting directly from R and Python seems like a useful thing to add, so I've created a JIRA ticket for it here: https://0xdata.atlassian.net/browse/PUBDEV-4449

    The reference to the ROC curve in the docs refers to the H2O Flow GUI, which will automatically plot a ROC curve for any binary classification model in your H2O cluster. All the other items in that list are in fact available directly in R and Python, however.

    If you train a model in R, you can visit the Flow interface (e.g. localhost:54321) and click on a binomial model to see it's ROC curves (training, validation and cross-validated versions). It will look like this:

    0 讨论(0)
  • 2021-01-06 01:19

    Building off @Lauren's example, after you run model.performance you can extract all necessary information for ggplot from perf@metrics$thresholds_and_metric_scores. This code produces the ROC curve, but you can also add precision, recall to the selected variables for plotting the PR curve.

    Here is some example code using the same model as above.

    library(h2o)
    library(dplyr)
    library(ggplot2)
    
    h2o.init()
    
    # Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
    prostatePath <- system.file("extdata", "prostate.csv", package = "h2o")
    prostate.hex <- h2o.importFile(
        path = prostatePath, 
        destination_frame = "prostate.hex"
        )
    glm <- h2o.glm(
        y = "CAPSULE",
        x = c("AGE", "RACE", "PSA", "DCAPS"), 
        training_frame = prostate.hex, 
        family = "binomial", 
        nfolds = 0, 
        alpha = 0.5, 
        lambda_search = FALSE
    )
    
    # Model performance
    perf <- h2o.performance(glm, newdata = prostate.hex)
    
    # Extract info for ROC curve
    curve_dat <- data.frame(perf@metrics$thresholds_and_metric_scores) %>%
        select(c(tpr, fpr))
    
    # Plot ROC curve
    ggplot(curve_dat, aes(x = fpr, y = tpr)) +
        geom_point() +
        geom_line() +
        geom_segment(
            aes(x = 0, y = 0, xend = 1, yend = 1),
            linetype = "dotted",
            color = "grey50"
            ) +
        xlab("False Positive Rate") +
        ylab("True Positive Rate") +
        ggtitle("ROC Curve") +
        theme_bw()
    

    Which produces this plot:

    roc_plot

    0 讨论(0)
  • 2021-01-06 01:20

    you can get the roc curve by passing the model performance metrics to H2O's plot function.

    shortened code snippet which assumes you created a model, call it glm, and split your dataset into train and validation sets:

    perf <- h2o.performance(glm, newdata = validation)
    h2o.plot(perf)
    

    full code snippet below:

    h2o.init()
    
    # Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
    prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
    prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
    glm = h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), training_frame = prostate.hex, family = "binomial", nfolds = 0, alpha = 0.5, lambda_search = FALSE)
    
    perf <- h2o.performance(glm, newdata = prostate.hex)
    h2o.plot(perf)
    

    and this will produce the following:

    0 讨论(0)
  • 2021-01-06 01:22

    A naive solution is to use plot() generic function to plot a H2OMetrics object:

    logit_fit <- h2o.glm(colnames(training)[-1],'y',training_frame =
        training.hex,validation_frame=validation.hex,family = 'binomial')
    plot(h2o.performance(logit_fit),valid=T),type='roc')
    

    This will give us a plot:

    But it is hard to customize, especially to change the line type, since the type parameter is already taken as 'roc'. Also I have not found a way to plot multiple models' ROC curves together on one plot. I have come up with a method to extract true positive rate and false positive rate from the H2OMetrics object and use ggplot2 to plot the ROC curves on one plot by myself. Here is the example code(uses a lot of tidyverse syntax):

    # for example I have 4 H2OModels
    list(logit_fit,dt_fit,rf_fit,xgb_fit) %>% 
      # map a function to each element in the list
      map(function(x) x %>% h2o.performance(valid=T) %>% 
            # from all these 'paths' in the object
            .@metrics %>% .$thresholds_and_metric_scores %>% 
            # extracting true positive rate and false positive rate
            .[c('tpr','fpr')] %>% 
            # add (0,0) and (1,1) for the start and end point of ROC curve
            add_row(tpr=0,fpr=0,.before=T) %>% 
            add_row(tpr=0,fpr=0,.before=F)) %>% 
      # add a column of model name for future grouping in ggplot2
      map2(c('Logistic Regression','Decision Tree','Random Forest','Gradient Boosting'),
            function(x,y) x %>% add_column(model=y)) %>% 
      # reduce four data.frame to one
      reduce(rbind) %>% 
      # plot fpr and tpr, map model to color as grouping
      ggplot(aes(fpr,tpr,col=model))+
      geom_line()+
      geom_segment(aes(x=0,y=0,xend = 1, yend = 1),linetype = 2,col='grey')+
      xlab('False Positive Rate')+
      ylab('True Positive Rate')+
      ggtitle('ROC Curve for Four Models')
    

    Then the ROC curve is:

    0 讨论(0)
提交回复
热议问题