问题
I have a Pandas data frame and I need to convert it to H2O frame. I use the following code-
Code:
# Convert pandas dataframe to H2O frame
start_time = time.time()
input_data_matrix = h2o.H2OFrame(input_df)
logger.debug("3. Time taken to convert H2O Frame- " + str(time.time() - start_time))
Output:
2019-02-05 04:38:55,238 logger DEBUG 3. Time taken to convert H2O Frame- 9320.119945764542
The data frame (i.e. input_df
) size 183K x 435 with no null or NaN values.
It is taking around 2 hours. Is there any better way to perform this operation?
回答1:
Save the pandas data frame to a csv file. (Skip this step if you loaded it from a csv file in the first place, and haven't done any data munging on it, of course.)
Put that csv file somewhere the h2o server can see it. (If you are running client and server on the same machine, this is already the case.)
Use
h2o.import_file()
(in preference toh2o.upload_file()
orh2o.H2OFrame()
)
The h2o.import_file()
is the quickest way to get data into H2O, but the file must be visible by the server. When dealing with a remote cluster, this might mean uploading it to that servers file system, or putting it on a web server, or an HDFS cluster, or on AWS S3, etc, etc.
(The reason h2o.upload_file()
is slower is that it will do an HTTP POST of the data, from client to server, and h2o.H2OFrame()
is slower because it exports your pandas data to a temp csv file, then uses h2o.upload_file()
, then deletes the temp file afterwards.)
来源:https://stackoverflow.com/questions/54541358/is-there-efficient-way-to-convert-pandas-dataframe-to-h2o-frame