问题
I built a H2O (v. 3.14) GLM model. However, when I check the predictions using h2o.predict, I got very different results based on how many rows I use in the validation set.
Calling h2o.predict on the first 10 rows, I got:
# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])
# Result:
predict p0 p1
1 0 0.9999224 7.756014e-05
2 0 0.9962711 3.728930e-03
3 0 0.9997378 2.622195e-04
4 0 0.9999556 4.437544e-05
5 0 0.9998994 1.006037e-04
6 0 0.9999394 6.062479e-05
But if I call h2o.predict on the first 100 rows, I got very different result.
h2o.predict(glm.test, df.valid[1:100,])
# Result:
predict p0 p1
1 1 0.06196439 0.9380356
2 1 0.15371122 0.8462888
3 1 0.01654756 0.9834524
4 1 0.12830090 0.8716991
5 1 0.07195659 0.9280434
6 1 0.09725532 0.9027447
I have posted the code which repro the problem. The data set (which is very sparse) can be downloaded from https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz
h2o.removeAll()
# Note: The zipped data file can be downloaded from:
# https://www.dropbox.com/s/58ul6zrekpmjh20/dt.truth.csv.gz
df.truth <- h2o.importFile(
path="data/dt.truth.csv.gz", sep=",", header=T)
df.truth$isTarget <- h2o.asfactor(df.truth$isTarget)
# Split into train / test
splits <- h2o.splitFrame(df.truth, c(0.7), seed=1234)
df.train <- h2o.assign(splits[[1]], "df.train.hex")
df.valid <- h2o.assign(splits[[2]], "df.valid.hex")
# Build a GLM model
glm.test <- h2o.glm(
training_frame = df.train,
y="isTarget",
family = "binomial",
missing_values_handling = "MeanImputation",
seed = 1000000)
# Predict using the first 10 lines in validation set
h2o.predict(glm.test, df.valid[1:10,])
# Predict using the first 100 lines in validation set. Got very different result!
h2o.predict(glm.test, df.valid[1:100,])
来源:https://stackoverflow.com/questions/47404817/glm-model-h2o-predict-gives-very-different-results-depending-on-number-of-rows