H2O R api: retrieving optimal model from grid search

问题

I'm using the h2o package (v 3.6.0) in R, and I've built a grid search model. Now, I'm trying to access the model which minimizes MSE on the validation set. In python's sklearn, this is easily achievable when using RandomizedSearchCV:

## Pseudo code:
grid = RandomizedSearchCV(model, params, n_iter = 5)
grid.fit(X)
best = grid.best_estimator_

This, unfortunately, does not prove as straightforward in h2o. Here's an example you can recreate:

library(h2o)
## assume you got h2o initialized...

X <- as.h2o(iris[1:100,]) # Note: only using top two classes for example 
grid <- h2o.grid(
    algorithm = 'gbm',
    x = names(X[,1:4]),
    y = 'Species',
    training_frame = X,
    hyper_params = list(
        distribution = 'bernoulli',
        ntrees = c(25,50)
    )
)

Viewing grid prints a wealth of information, including this portion:

> grid
ntrees distribution status_ok                                                                 model_ids
 50    bernoulli        OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1
 25    bernoulli        OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0

With a bit of digging, you can access each individual model and view every metric imaginable:

> h2o.getModel(grid@model_ids[[1]])
H2OBinomialModel: gbm
Model ID:  Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1 
Model Summary: 
  number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1              50                4387         1         1    1.00000          2          2     2.00000


H2OBinomialMetrics: gbm
** Reported on training data. **

MSE:  1.056927e-05
R^2:  0.9999577
LogLoss:  0.003256338
AUC:  1
Gini:  1

Confusion Matrix for F1-optimal threshold:
           setosa versicolor    Error    Rate
setosa         50          0 0.000000   =0/50
versicolor      0         50 0.000000   =0/50
Totals         50         50 0.000000  =0/100

Maximum Metrics: Maximum metrics at their respective thresholds
                      metric threshold    value idx
1                     max f1  0.996749 1.000000   0
2                     max f2  0.996749 1.000000   0
3               max f0point5  0.996749 1.000000   0
4               max accuracy  0.996749 1.000000   0
5              max precision  0.996749 1.000000   0
6           max absolute_MCC  0.996749 1.000000   0
7 max min_per_class_accuracy  0.996749 1.000000   0

And with a lot of digging, you can finally get to this:

> h2o.getModel(grid@model_ids[[1]])@model$training_metrics@metrics$MSE
[1] 1.056927e-05

This seems like a lot of kludgey work to get down to a metric that ought to be top-level for model selection (yes, I'm now interjecting my opinions...). In my situation, I've got a grid with hundreds of models, and my current, hacky solution just doesn't seems very "R-esque":

model_select_ <- function(grid) {
  model_ids <- grid@model_ids
  min = Inf
  best_model = NULL

  for(model_id in model_ids) {
    model <- h2o.getModel(model_id)
    mse <- model@model$training_metrics@metrics$MSE
    if(mse < min) {
      min <- mse
      best_model <- model
    }
  }

  best_model
}

This is so utilitarian for something that is so core to the practice of machine learning, and it just strikes me as odd that h2o would not have a "cleaner" method of extracting the optimal model, or at least model metrics.

Am I missing something? Is there no "out of the box" method for selecting the best model?

回答1:

Yes, there is an easy way to extract the "top" model of an H2O grid search. There are also utility functions that will extract all the model metrics (e.g. h2o.mse) that you have been trying to access. Examples of how to do these things can be found in the h2o-r/demos and h2o-py/demos subfolders on the h2o-3 GitHub repo.

Since you are using R, here is a relevant code example that includes a grid search, with sorted results. You can also find how to access this information in the R documentation for the h2o.getGrid function.

Print out the auc for all of the models, sorted by validation AUC:

auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE)
print(auc_table)

Here is an example of the output:

H2O Grid Details
================

Grid ID: eeg_demo_gbm_grid 
Used hyper parameters: 
  -  ntrees 
  -  max_depth 
  -  learn_rate 
Number of models: 18 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   ntrees max_depth learn_rate                  model_ids               auc
1     100         5        0.2 eeg_demo_gbm_grid_model_17 0.967771493797284
2      50         5        0.2 eeg_demo_gbm_grid_model_16 0.949609591795923
3     100         5        0.1  eeg_demo_gbm_grid_model_8  0.94941792664595
4      50         5        0.1  eeg_demo_gbm_grid_model_7 0.922075196552274
5     100         3        0.2 eeg_demo_gbm_grid_model_14 0.913785959685157
6      50         3        0.2 eeg_demo_gbm_grid_model_13 0.887706691652792
7     100         3        0.1  eeg_demo_gbm_grid_model_5 0.884064379717198
8       5         5        0.2 eeg_demo_gbm_grid_model_15 0.851187402678818
9      50         3        0.1  eeg_demo_gbm_grid_model_4 0.848921799270639
10      5         5        0.1  eeg_demo_gbm_grid_model_6 0.825662907513139
11    100         2        0.2 eeg_demo_gbm_grid_model_11 0.812030639460551
12     50         2        0.2 eeg_demo_gbm_grid_model_10 0.785379521713437
13    100         2        0.1  eeg_demo_gbm_grid_model_2  0.78299280750123
14      5         3        0.2 eeg_demo_gbm_grid_model_12 0.774673686150002
15     50         2        0.1  eeg_demo_gbm_grid_model_1 0.754834657912535
16      5         3        0.1  eeg_demo_gbm_grid_model_3 0.749285131682721
17      5         2        0.2  eeg_demo_gbm_grid_model_9 0.692702793188135
18      5         2        0.1  eeg_demo_gbm_grid_model_0 0.676144542037133

The top row in the table contains the model with the best AUC, so below we can grab that model and extract the validation AUC:

best_model <- h2o.getModel(auc_table@model_ids[[1]])
h2o.auc(best_model, valid = TRUE)

In order for the h2o.getGrid function to be able sort by a metric on the validation set, you need to actually pass the h2o.grid function a validation_frame. In your example above, you did not pass a validation_frame, so you can't evaluate the models in the grid on the validation set.

回答2:

This seems to be valid for recent versions of h2o only, with 3.8.2.3 you get a Java exception saying that "auc" is an invalid metric. The following fails :

library(h2o)
library(jsonlite)
h2o.init()
iris.hex <- as.h2o(iris)
h2o.grid("gbm", grid_id = "gbm_grid_id", x = c(1:4), y = 5,
     training_frame = iris.hex, hyper_params = list(ntrees = c(1,2,3)))
grid <- h2o.getGrid("gbm_grid_id", sort_by = "auc", decreasing = T)

However, replace 'auc' with 'logloss' and decrease = F, and it's fine.

回答3:

Unfortunately the H2O grid function uses training_frame not validation_frame when you pass them both in. Consequently the winning model is extremely overfitted and useless. EDIT: Well, correction here, it's actually useful to have training bias very low like this, for purposes of learning curve analysis and bias versus variance analyi. But to be clear I also need to be able to run again and get a validation dataset to be used as search criterion for final model fitting and selection.

For example here is a winning model from the grid function on a GBM, where validation_frame was passed in, and AUC was the search metric. You can see that the validation_auc starts at 0.5 and actually worsens to 0.44 on the final scoring history of the winning model:

Scoring History: 
            timestamp          duration number_of_trees training_rmse
1 2017-02-06 10:09:19  6 min 13.153 sec               0       0.70436
2 2017-02-06 10:09:23  6 min 16.863 sec             100       0.70392
3 2017-02-06 10:09:27  6 min 20.950 sec             200       0.70343
4 2017-02-06 10:09:31  6 min 24.806 sec             300       0.70289
5 2017-02-06 10:09:35  6 min 29.244 sec             400       0.70232
6 2017-02-06 10:09:39  6 min 33.069 sec             500       0.70171
7 2017-02-06 10:09:43  6 min 37.243 sec             600       0.70107
  training_logloss training_auc training_lift training_classification_error
1          2.77317      0.50000       1.00000                       0.49997
2          2.69896      0.99980      99.42857                       0.00026
3          2.62768      0.99980      99.42857                       0.00020
4          2.55902      0.99982      99.42857                       0.00020
5          2.49675      0.99993      99.42857                       0.00020
6          2.43712      0.99994      99.42857                       0.00020
7          2.38071      0.99994      99.42857                       0.00013
  validation_rmse validation_logloss validation_auc validation_lift
1         0.06921            0.03058        0.50000         1.00000
2         0.06921            0.03068        0.45944         9.03557
3         0.06922            0.03085        0.46685         9.03557
4         0.06922            0.03107        0.46817         9.03557
5         0.06923            0.03133        0.45656         9.03557
6         0.06924            0.03163        0.44947         9.03557
7         0.06924            0.03192        0.44400         9.03557
  validation_classification_error
1                         0.99519
2                         0.00437
3                         0.00656
4                         0.00656
5                         0.00700
6                         0.00962
7                         0.00962

来源：https://stackoverflow.com/questions/35657989/h2o-r-api-retrieving-optimal-model-from-grid-search

标签

python

h2o