问题
I have create a DF and want to convert it to H2O Frame.
To do that, I do:
library(h2o)
h2o.init(nthreads=-1)
df<-data.table(matrix(0,ncol=46,nrow=30000))
df<-as.h2o(df)
When I do htop on the comand line I see that only one processor of the 4 available are working. It is not possible to do in other way?
Thanks!
回答1:
There are two factors at work here.
1) The first is you are using as.h2o()
, which is the not-very-efficient "push" method (where the client pushes data to the server) of ingesting data.
This is meant for small data and for convenience (which is fine for this case, because you created a dataset with 30,000 rows, which is small data).
If you want H2O to ingest data efficiently, you need to use the "pull" method, where H2O pulls data from the data store into H2O's memory. In R, this would be h2o.importFile()
.
2) The second factor is H2O uses chunking of data (contiguous rows in the dataset) to get data parallelism. The number of chunks per column directly affects the number of threads that work in parallel. Once a dataset is read in, if it only has 1 chunk per column, then it will only be able to use 1 thread (and hence 1 core). You can see the number of chunks per column by looking at how the data was parsed in the H2O Flow Web UI.
I ran your program above; see how the Frame Distribution Summary for the resulting H2O Frame shows that the number of chunks per column is 1:
Running the same program again with 3,000,000 rows gives 66 chunks per column:
This is much better because now once you try to do stuff with the data in H2O (like train a model) you will get up to 66 threads running in parallel on a distributed cluster.
[ Note for the bigger case, the data ingestion itself took a few minutes on my laptop and was still slow and single-threaded because it's using the inefficient as.h2o()
"push" approach. If you wrote the dataset out to a csv file, and had H2O parse it with the h2o.importFile()
"pull" approach, it would be much faster. ]
来源:https://stackoverflow.com/questions/45819380/h2o-not-working-on-parallel