I\'ve looked through many examples and I\'m still confused. I\'ve compiled a simple latency check program from here, and it runs perfectly on one host, but when I try to run
There are two kinds of communication involved in running an Open MPI job. First the job has to be launched. Open MPI uses a special framework to support many kinds of launches and you are probably using the rsh
remote login launch mechanism over SSH. Obviously your firewall is correctly set up to allow SSH connections.
When an Open MPI job is launched and the processes are true MPI programs, they connect back to the mpirun
process that spawned the job and learn all about the other processes in the job, most importantly the available network endpoints at each process. This message:
[4][[5989,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.0.2.5 failed: Connection timed out (110)
indicates that the process which runs on host 4
is unable to open a TCP connection to the process which runs on host 5
. The most common reason for that is the presence of a firewall, which limits the inbound connections. So checking your firewall is the first thing to do.
Another common reason is if on both nodes there are additional network interfaces configured and up, with compatible network addresses, but without the possibility to establish connection between them. This often happens on newer Linux setups where various virtual and/or tunnelling interfaces are being brought up by default. One can instruct Open MPI to skip those interfaces by listing them (either as interface names or as CIDR network addresses) in the btl_tcp_if_exclude
MCA parameter, e.g.:
$ mpirun --mca btl_tcp_if_exclude "127.0.0.1/8,tun0" ...
(one always have to add the loopback interface if setting btl_tcp_if_exclude
)
or one can explicitly specify which interfaces to be used for communication by listing them in the btl_tcp_if_include
MCA parameter:
$ mpirun --mca btl_tcp_if_include eth0 ...
Since the IP address in the error message matches the address of your second host in the hostfile, then the problem must come from an active firewall rule.