I am working with 10GB training data frame. I use H2o library for faster computation. Each time I load the dataset, I should convert the data frame into H2o object which is
as.h2o(d)
works like this (even when client and server are the same machine):
d
to a csv file in a temp locationh2o.uploadFile()
which does an HTTP POST to the server, then a single-threaded import.Instead, prepare your data in advance somewhere(*), then use h2o.importFile()
(See http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.importFile.html). This saves messing around with the local file, and it can also do a parallelized read and import.
*: For speediest results, the "somewhere" should be as close to the server as possible. For it to work at all, the "somewhere" has to be somewhere the server can see. If client and server are the same machine, then that is automatic. At the other extreme, if your server is a cluster of machines in an AWS data centre on another continent, then putting the data into S3 works well. You can also put it on HDFS, or on a web server.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/importing-data.html for some examples in both R and Python.