I am currently running h2o
\'s DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes).
My data set has 1m rows and 41 columns (40 predic
If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.
Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.
If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.
As Erin says, often adding more nodes just adds the capability for bigger data sets, not quicker learning. Random forest might be the worst; I get fairly good results with deep learning (e.g. 3x quicker with 4 nodes, 5-6x quicker with 8 nodes).
In your comment on Erin's answer you mention the real problem is you want to speed up hyper-parameter optimization? It is frustrating that h2o.grid()
doesn't support building models in parallel, one on each node, when the data will fit in memory on each node. But you can do that yourself, with a bit of scripting: set up one h2o cluster on each node, do a grid search with a subset of hyper-parameters on each node, have them save the results and models to S3, then bring the results in and combine them at the end. (If doing a random grid search, you can run exactly the same grid on each cluster, but it might be a good idea to explicitly use a different seed on each.)