问题
When predicting on a test set where the response variable is not present, h2o fails in various different ways if one hot encoding was used for a factor variable in the training, either when specified implicitly when training a GLM or when specifying it explicitly in other methods.
This error is present in R 3.4.0 and h2o 3.12.0.1. We have also tested with h2o 3.10.3.3
library(h2o)
localH2O = h2o.init()
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = read.csv(prostatePath)
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:380),1))))
prostate.hex<-as.h2o(prostate.hex)
prostate.hex$weight<-1
prostate_train<-prostate.hex[1:300,]
prostate_test<-prostate.hex[301:380,]
prostate_test<-prostate_test[,-3] #delete response variable from test data
model<-h2o.glm(y = "AGE", x = c("our_factor"),
training_frame = prostate_train,offset_column="weight")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)
model<-h2o.glm(y = "AGE", x = c("our_factor"),
training_frame = prostate_train)
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)
model<-h2o.gbm(y = "AGE", x = c("our_factor"),
training_frame = prostate_train,categorical_encoding = "OneHotExplicit")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)
The first GLM example that was trained with an offset column produces all NaNs when predicting on the test data. The second GLM example produces this error:
DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:390)
at water.MRTask.doAll(MRTask.java:396)
at hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1215)
at hex.Model.score(Model.java:1077)
at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at hex.DataInfo.extractDenseRow(DataInfo.java:1025)
at hex.glm.GLMScore.map(GLMScore.java:148)
at water.MRTask.compute2(MRTask.java:657)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1352)
at hex.glm.GLMScore$Icer.compute1(GLMScore$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1348)
... 5 more
Error: DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
The GBM example produces this error (even though the only column missing from the test data is the response variable):
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
at hex.Model.adaptTestForTrain(Model.java:1028)
at hex.Model.adaptTestForTrain(Model.java:854)
at hex.Model.score(Model.java:1072)
at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Error: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
The error seems to be specific to factor variables and using one hot encoding explicitly. It can be worked around by adding a 'fake' response column to the test dataset (we've tested this, and the value of this column makes no difference to the predictions, as we'd expect), but that's obviously not ideal.
The errors remain even if all the factor levels are present in both the train and test set, if there are 5 or more factor levels:
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:5),76))))
If there are 4 or less, there are no problems with the GLM, but the error message from the GBM remains
来源:https://stackoverflow.com/questions/44901421/h2o-predictions-sometimes-fail-when-response-variable-not-present-in-test-set