H2O running slower than data.table R

人盡茶涼 提交于 2020-01-04 05:31:29

问题


How it is possible that storing data into H2O matrix are slower than in data.table?

#Packages used "H2O" and "data.table"
library(h2o)
library(data.table)
#create the matrix
matrix1<-data.table(matrix(rnorm(1000*1000),ncol=1000,nrow=1000))
matrix2<-h2o.createFrame(1000,1000)

h2o.init(nthreads=-1)
#Data.table variable store
for(i in 1:1000){
matrix1[i,1]<-3
}
#H2O Matrix Frame store
for(i in 1:1000){
  matrix2[i,1]<-3
}

Thanks!


回答1:


H2O is a client/server architecture. (See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/architecture.html)

So what you've shown is a very inefficient way to specify an H2O frame in H2O memory. Every write is going to be turning into a network call. You almost certainly don't want this.

For your example, since the data isn't large, a reasonable thing to do would be to do the initial assignment to a local data frame (or datatable) and then use push method of as.h2o().

h2o_frame = as.h2o(matrix1)
head(h2o_frame)

This pushes an R data frame from the R client into an H2O frame in H2O server memory. (And you can do as.data.table() to do the opposite.)


data.table Tips:

For data.table, prefer the in-place := syntax. This avoids copies. So, for example:

matrix1[i, 3 := 42]

H2O Tips:

The fastest way to read data into H2O is by ingesting it using the pull method in h2o.importFile(). This is parallel and distributed.

The as.h2o() trick shown above works well for small datasets that easily fit in memory of one host.

If you want to watch the network messages between R and H2O, call h2o.startLogging().




回答2:


I can't answer your question because I don't know h20. However I can make a guess.

Your code to fill the data.table is slow because of the "copy-on-modify" semantic. If you update your table by reference you will incredibly speed-up your code.

for(i in 1:1000){ 
  matrix1[i,1]<-3 
}

for(i in 1:1000){ 
  set(matrix1, i, 1L, 3) 
}

With set my loop takes 3 millisec, while your loop takes 18 sec (6000 times more).

I suppose h2o to work the same way but with some extra stuff done because this is a special object. Maybe some message passing communication to the H2O cluster?



来源:https://stackoverflow.com/questions/45783048/h2o-running-slower-than-data-table-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!