问题
How it is possible that storing data into H2O matrix are slower than in data.table?
#Packages used "H2O" and "data.table"
library(h2o)
library(data.table)
#create the matrix
matrix1<-data.table(matrix(rnorm(1000*1000),ncol=1000,nrow=1000))
matrix2<-h2o.createFrame(1000,1000)
h2o.init(nthreads=-1)
#Data.table variable store
for(i in 1:1000){
matrix1[i,1]<-3
}
#H2O Matrix Frame store
for(i in 1:1000){
matrix2[i,1]<-3
}
Thanks!
回答1:
H2O is a client/server architecture. (See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html)
So what you've shown is a very inefficient way to specify an H2O frame in H2O memory. Every write is going to be turning into a network call. You almost certainly don't want this.
For your example, since the data isn't large, a reasonable thing to do would be to do the initial assignment to a local data frame (or datatable) and then use push method of as.h2o().
h2o_frame = as.h2o(matrix1)
head(h2o_frame)
This pushes an R data frame from the R client into an H2O frame in H2O server memory. (And you can do as.data.table() to do the opposite.)
data.table Tips:
For data.table, prefer the in-place := syntax. This avoids copies. So, for example:
matrix1[i, 3 := 42]
H2O Tips:
The fastest way to read data into H2O is by ingesting it using the pull method in h2o.importFile(). This is parallel and distributed.
The as.h2o() trick shown above works well for small datasets that easily fit in memory of one host.
If you want to watch the network messages between R and H2O, call h2o.startLogging().
回答2:
I can't answer your question because I don't know h20
. However I can make a guess.
Your code to fill the data.table
is slow because of the "copy-on-modify" semantic. If you update your table by reference you will incredibly speed-up your code.
for(i in 1:1000){
matrix1[i,1]<-3
}
for(i in 1:1000){
set(matrix1, i, 1L, 3)
}
With set
my loop takes 3 millisec, while your loop takes 18 sec (6000 times more).
I suppose h2o
to work the same way but with some extra stuff done because this is a special object. Maybe some message passing communication to the H2O cluster?
来源:https://stackoverflow.com/questions/45783048/h2o-running-slower-than-data-table-r