How to get data into h2o fast

自闭症网瘾萝莉.ら 提交于 2019-12-04 05:11:58

Think of as.h2o() as a convenience function, that does these steps:

  1. converts your R data to a data.frame, if not already one.
  2. saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
  3. call h2o.uploadFile() on that temp file
  4. delete the temp file

As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:

  • With h2o.uploadFile() your client has to be able to see the file.
  • With h2o.importFile() your cluster has to be able to see the file.

When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)

Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.

*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!