Running MPI on two hosts

前端 未结 1 393
感动是毒
感动是毒 2021-01-01 07:41

I\'ve looked through many examples and I\'m still confused. I\'ve compiled a simple latency check program from here, and it runs perfectly on one host, but when I try to run

相关标签:
1条回答
  • 2021-01-01 08:34

    There are two kinds of communication involved in running an Open MPI job. First the job has to be launched. Open MPI uses a special framework to support many kinds of launches and you are probably using the rsh remote login launch mechanism over SSH. Obviously your firewall is correctly set up to allow SSH connections.

    When an Open MPI job is launched and the processes are true MPI programs, they connect back to the mpirun process that spawned the job and learn all about the other processes in the job, most importantly the available network endpoints at each process. This message:

    [4][[5989,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.0.2.5 failed: Connection timed out (110)
    

    indicates that the process which runs on host 4 is unable to open a TCP connection to the process which runs on host 5. The most common reason for that is the presence of a firewall, which limits the inbound connections. So checking your firewall is the first thing to do.

    Another common reason is if on both nodes there are additional network interfaces configured and up, with compatible network addresses, but without the possibility to establish connection between them. This often happens on newer Linux setups where various virtual and/or tunnelling interfaces are being brought up by default. One can instruct Open MPI to skip those interfaces by listing them (either as interface names or as CIDR network addresses) in the btl_tcp_if_exclude MCA parameter, e.g.:

    $ mpirun --mca btl_tcp_if_exclude "127.0.0.1/8,tun0" ...
    

    (one always have to add the loopback interface if setting btl_tcp_if_exclude)

    or one can explicitly specify which interfaces to be used for communication by listing them in the btl_tcp_if_include MCA parameter:

    $ mpirun --mca btl_tcp_if_include eth0 ...
    

    Since the IP address in the error message matches the address of your second host in the hostfile, then the problem must come from an active firewall rule.

    0 讨论(0)
提交回复
热议问题