How should we interpret the results of the H2O predict function?

问题

I have trained and stored a random forest binary classification model. Now I'm trying to simulate processing new (out-of-sample) data with this model. My Python (Anaconda 3.6) code is:

import h2o
import pandas as pd
import sys

localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
h2o.remove_all()

model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
model = h2o.load_model(model_path)

new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
print(new_data.head(10))

predict = model.predict(new_data)  # predict returns a data frame
print(predict.describe())
predicted = predict[0,0]
probability = predict[0,2]  # probability the prediction is a "1"

print('prediction: ', predicted, ', probability: ', probability)

When I run this code I get:

>>> import h2o
>>> import pandas as pd
>>> import sys
>>> localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
--------------------------  ------------------------------
H2O cluster uptime:         22 hours 22 mins
H2O cluster version:        3.10.5.4
H2O cluster version age:    18 days
H2O cluster name:           H2O_from_python_Charles_0fqq0c
H2O cluster total nodes:    1
H2O cluster free memory:    6.790 Gb
H2O cluster total cores:    8
H2O cluster allowed cores:  8
H2O cluster status:         locked, healthy
H2O connection url:         http://localhost:54321
H2O connection proxy:
H2O internal security:      False
Python version:             3.6.1 final
--------------------------  ------------------------------
>>> h2o.remove_all()
>>> model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
>>> model = h2o.load_model(model_path)
>>> new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%
>>> print(new_data.head(10))
  BoxRatio    Thrust    Velocity    OnBalRun    vwapGain
----------  --------  ----------  ----------  ----------
     1.502    55.044        0.38          37       0.845

[1 row x 5 columns]

>>> predict = model.predict(new_data)  # predict returns a data frame

drf prediction progress: |████████████████████████████████████████████████| 100%
>>> print(predict.describe())
Rows:1
Cols:3


         predict    p0                  p1
-------  ---------  ------------------  -------------------
type     enum       real                real
mins                0.8849431818181818  0.11505681818181818
mean                0.8849431818181818  0.11505681818181818
maxs                0.8849431818181818  0.11505681818181818
sigma               0.0                 0.0
zeros               0                   0
missing  0          0                   0
0        1          0.8849431818181818  0.11505681818181818
None
>>> predicted = predict[0,0]
>>> probability = predict[0,2]  # probability the prediction is a "1"
>>> print('prediction: ', predicted, ', probability: ', probability)
prediction:  1 , probability:  0.11505681818181818
>>>

I am confused by the contents of the "predict" data frame. Please tell me what the numbers in the columns labeled "p0" and "p1" mean. I hope they are probabilities, and as you can see by my code, I am trying to get the predicted classification (0 or 1) and a probability that this classification is correct. Does my code correctly do that?

Any comments will be greatly appreciated. Charles

回答1:

p0 is the probability (between 0 and 1) that class 0 is chosen.

p1 is the probability (between 0 and 1) that class 1 is chosen.

The thing to keep in mind is that the "prediction" is made by applying a threshold to p1. That threshold point is chosen depending on whether you want to reduce false positives or false negatives. It's not just 0.5.

The threshold chosen for "the prediction" is max-F1. But you can extract out p1 yourself and threshold it any way you like.

回答2:

Darren Cook asked me to post the first few lines of my training data. Here is is:

   BoxRatio  Thrust  Velocity  OnBalRun  vwapGain  Altitude
0     0.000   0.000     2.186     4.534     0.361         1
1     0.000   0.000     0.561     2.642     0.909         1
2     2.824   2.824     2.199     4.748     1.422         1
3     0.442   0.452     1.702     3.695     1.186         0
4     0.084   0.088     0.612     1.699     0.700         1

The response column is labeled "Altitude". Class 1 is what I want to see from new "out-of-sample" data. "1" is good, and it means that "Altitude" was reached (true positive). "0" means that "Altitude" was not reached (true negative). In the predict table above, "1" was predicted with a probability of 0.11505681818181818. This does not make sense to me.

Charles

来源：https://stackoverflow.com/questions/45523997/how-should-we-interpret-the-results-of-the-h2o-predict-function

标签

python-3.x

h2o