GLM model: h2o.predict gives very different results depending on number of rows used in the validation data

问题

I built a H2O (v. 3.14) GLM model. However, when I check the predictions using h2o.predict, I got very different results based on how many rows I use in the validation set.

Calling h2o.predict on the first 10 rows, I got:

# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])
# Result:
  predict        p0           p1
1       0 0.9999224 7.756014e-05
2       0 0.9962711 3.728930e-03
3       0 0.9997378 2.622195e-04
4       0 0.9999556 4.437544e-05
5       0 0.9998994 1.006037e-04
6       0 0.9999394 6.062479e-05

But if I call h2o.predict on the first 100 rows, I got very different result.

h2o.predict(glm.test, df.valid[1:100,])
# Result:
  predict         p0        p1
1       1 0.06196439 0.9380356
2       1 0.15371122 0.8462888
3       1 0.01654756 0.9834524
4       1 0.12830090 0.8716991
5       1 0.07195659 0.9280434
6       1 0.09725532 0.9027447

I have posted the code which repro the problem. The data set (which is very sparse) can be downloaded from https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz

h2o.removeAll()

# Note: The zipped data file can be downloaded from:
#       https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz

df.truth <- h2o.importFile(
  path="data/dt.truth.csv.gz", sep=",", header=T)

df.truth$isTarget <- h2o.asfactor(df.truth$isTarget)

# Split into train / test
splits <- h2o.splitFrame(df.truth, c(0.7), seed=1234)
df.train <- h2o.assign(splits[[1]], "df.train.hex")   
df.valid <- h2o.assign(splits[[2]], "df.valid.hex")

# Build a GLM model
glm.test <- h2o.glm(         
  training_frame = df.train,        
  y="isTarget",                 
  family = "binomial",
  missing_values_handling = "MeanImputation",
  seed = 1000000) 

# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])

# Predict using the first 100 lines in validation set.  Got very different result!
h2o.predict(glm.test, df.valid[1:100,])

来源：https://stackoverflow.com/questions/47404817/glm-model-h2o-predict-gives-very-different-results-depending-on-number-of-rows

标签

h2o