Parallel processing in R with H2O

后端未结

关注

 2  1762

粉色の甜心 2021-02-10 05:22

I am setting up a piece of code to parallel processes some computations for N groups in my data using foreach.

I have a computation that involves a call to

2条回答

渐次进展 (楼主)

2021-02-10 06:00
The way your code is currently set up won't be the best option. I understand what you are trying to do -- execute a bunch of GBMs in parallel (each on a single core H2O cluster), so you can maximize the CPU usage across the 12 cores on your machine. However, what your code will do is try to run all the GBMs in your foreach loop in parallel on the same single-core H2O cluster. You can only connect to one H2O cluster at a time from a single R instance, however the foreach loop will create a new R instance.

Unlike most machine learning algos in R, the H2O algos are all multi-core enabled so the training process will already be parallelized at the algorithm level, without the need for a parallel R package like foreach.

You have a few options (#1 or #3 is probably best):
1. Set h2o.init(nthreads = -1) at the top of your script to use all 12 of your cores. Change the foreach() loop to a regular loop and train each GBM (on a different data partition) sequentially. Although the different GBMs are trained sequentially, each single GBM will be fully parallelized across the H2O cluster.
2. Set h2o.init(nthreads = -1) at the top of your script, but keep your foreach() loop. This should run all your GBMs at once, with each GBM parallelized across all cores. This could overwhelm the H2O cluster a bit (this is not really how H2O is meant to be used) and could be a bit slower than #1, but it's hard to say without knowing the size of your data and the number of partitions of you want to train on. If you are already using 70% of your RAM for a single GBM, then this might not be the best option.
3. You can update your code to do the following (which most closely resembles your original script). This will preserve your foreach loop, creating a new 1-core H2O cluster at a different port on your machine. See below.
Updated R code example which uses the iris dataset and returns the predicted class for iris as a data.frame:
```
library(foreach)
library(doParallel)
library(h2o)
h2o.shutdown(prompt = FALSE)

#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)

#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
  library(h2o)
  port <- 54321 + 3*i
  print(paste0("http://localhost:", port))
  h2o.init(nthreads = 1, max_mem_size = "1G", port = port)
  df4 <- data.frame()
  data(iris)
  data <- as.h2o(iris)
  ss <- h2o.splitFrame(data)
  gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
  df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}
```
In order to judge which option is best, I would try running this on a few data partitions (maybe 10-100) to see which approach seems to scale the best. If your training data is small, it's possible that #3 will be faster than #1, but overall, I'd say #1 is probably the most scalable/stable solution.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...