Starting h2o in hadoop cluster with specific connection node url

问题

Is there a way to start an h2o instance interface on a specific node of a cluster? For example...

When using the command:

$ hadoop jar h2odriver.jar -nodes 4 -mapperXmx 6g -output hdfsOutputDir

from say in the h2o install directory, in say node 172.18.4.62, I get the (abridged) output:

....
H2O node 172.18.4.65:54321 reports H2O cluster size 1
H2O node 172.18.4.66:54321 reports H2O cluster size 1
H2O node 172.18.4.67:54321 reports H2O cluster size 1
H2O node 172.18.4.63:54321 reports H2O cluster size 1
H2O node 172.18.4.63:54321 reports H2O cluster size 4
H2O node 172.18.4.66:54321 reports H2O cluster size 4
H2O node 172.18.4.67:54321 reports H2O cluster size 4
H2O node 172.18.4.65:54321 reports H2O cluster size 4
H2O cluster (4 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)

Open H2O Flow in your web browser: http://172.18.4.65:54321

(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...

And from a python script that wants to connect to the h2o instance, I would do something like:

h2o.init(ip="172.18.4.65")

to connect to the h2o instance. However, it would be better to be able to control which address the h2o instance connection sits at.

Is there a way to do this? Is this question confused/wrong-headed? My overall goal is to have the python script run periodically, start an h2o cluster, do stuff on that cluster then shut the cluster down (not being able to know the address to use to connect to the cluster means the script would never be sure which address to connect to). Any advice would be appreciated. Thanks.

回答1:

When you start H2O cluster on Hadoop as below:

$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g -output /user/test

You will get an output as below just after the command is executed:

Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: x.x.x.217]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: x.x.x.217:39562

(You can override these with -driverif and -driverport/-driverportrange.)

As you can see the callback IP address is selected by the hadoop runtime. So in most of the cases the IP address and the port is select by the Hadoop run time to find best available,

You can also see the option of using -driverif x.x.x.x -driverport NNNNN along with hadoop command however I am not sure if this is really the good option. I haven't tested it besides the node ip which I am launching the cluster but it does work from the IP where the command it launched.

Based on my experience, the most popular way to start H2O cluster on Hadoop is to let the Hadoop decide the cluster, they just need to parse the output of the link as below:

Open H2O Flow in your web browser: x.x.x.x:54321

Parse the above line to get the IP address/port of the driver to connect from R/Python API.

来源：https://stackoverflow.com/questions/47722047/starting-h2o-in-hadoop-cluster-with-specific-connection-node-url

标签

h2o