问题
I am using H2O, on a large dataset, 8 Million rows and 10 col. I trained my randomForest using h2o.randomForest. The model was trained fine and also prediction worked correctly. Now I would like to convert my predictions to a data.frame. I did this :
A2=h2o.predict(m1,Tr15_h2o)
pred2=as.data.frame(A2)
but it is too slow, takes forever. Is there any faster way to do the conversion from H2o to data.frame or data.table?
回答1:
Here is some code which demonstrates how to use the data.table package on the backend, along with some benchmarks on my macbook:
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "16G")
hf <- h2o.createFrame(rows = 10000000)
options("h2o.use.data.table"=FALSE) #no data.table
system.time(df <- as.data.frame(hf))
# user system elapsed
# 224.387 13.274 272.252
options("datatable.verbose"=TRUE)
options("h2o.use.data.table"=TRUE) # use data.table
system.time(df2 <- as.data.frame(hf))
# user system elapsed
# 50.686 4.020 82.946
You can get more detailed info when using data.table if you turn on this option: options("datatable.verbose"=TRUE)
.
回答2:
We have seen this issue with large prediction datasets when exporting to prediction dataframe or converting them to other types takes long time. I have opened the following JIRA to track it now:
https://0xdata.atlassian.net/browse/PUBDEV-4166
回答3:
Yes there are some new options to turn on using data.table::fread
to speed it up. Type h2o:::as.data.frame.H2OFrame
to see the small amount of R source code containing the options, or H2O release notes. Please also try latest fread
from dev which is now parallel as of yesterday.
Once users have reported success we can turn the default on by default.
来源:https://stackoverflow.com/questions/42865609/how-to-convert-my-h2o-prediction-to-a-data-frame-in-a-fast-way