h2o predictions sometimes fail when response variable not present in test set

百般思念 提交于 2020-01-04 05:21:07

问题


When predicting on a test set where the response variable is not present, h2o fails in various different ways if one hot encoding was used for a factor variable in the training, either when specified implicitly when training a GLM or when specifying it explicitly in other methods.

This error is present in R 3.4.0 and h2o 3.12.0.1. We have also tested with h2o 3.10.3.3

 library(h2o)
localH2O = h2o.init()

prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = read.csv(prostatePath)
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:380),1))))

prostate.hex<-as.h2o(prostate.hex)
prostate.hex$weight<-1

prostate_train<-prostate.hex[1:300,]
prostate_test<-prostate.hex[301:380,]
prostate_test<-prostate_test[,-3] #delete response variable from test data

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,offset_column="weight")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train)
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.gbm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,categorical_encoding = "OneHotExplicit")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

The first GLM example that was trained with an offset column produces all NaNs when predicting on the test data. The second GLM example produces this error:

DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0

DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
    at water.MRTask.getResult(MRTask.java:478)
    at water.MRTask.getResult(MRTask.java:486)
    at water.MRTask.doAll(MRTask.java:390)
    at water.MRTask.doAll(MRTask.java:396)
    at hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1215)
    at hex.Model.score(Model.java:1077)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at hex.DataInfo.extractDenseRow(DataInfo.java:1025)
    at hex.glm.GLMScore.map(GLMScore.java:148)
    at water.MRTask.compute2(MRTask.java:657)
    at water.H2O$H2OCountedCompleter.compute1(H2O.java:1352)
    at hex.glm.GLMScore$Icer.compute1(GLMScore$Icer.java)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1348)
    ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0

The GBM example produces this error (even though the only column missing from the test data is the response variable):

java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
    at hex.Model.adaptTestForTrain(Model.java:1028)
    at hex.Model.adaptTestForTrain(Model.java:854)
    at hex.Model.score(Model.java:1072)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Error: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set

The error seems to be specific to factor variables and using one hot encoding explicitly. It can be worked around by adding a 'fake' response column to the test dataset (we've tested this, and the value of this column makes no difference to the predictions, as we'd expect), but that's obviously not ideal.

The errors remain even if all the factor levels are present in both the train and test set, if there are 5 or more factor levels:

prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:5),76))))

If there are 4 or less, there are no problems with the GLM, but the error message from the GBM remains

来源:https://stackoverflow.com/questions/44901421/h2o-predictions-sometimes-fail-when-response-variable-not-present-in-test-set

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!