MPI_SEND stops working after MPI_BARRIER

筅森魡賤 提交于 2019-12-01 10:52:05

Open MPI has a know feature when it uses TCP/IP for communications: it tries to use all configured network interfaces that are in "UP" state. This presents as a problem if some of the other nodes are not reachable through all those interfaces. This is part of the greedy communication optimisation that Open MPI employs and sometimes, like in your case, leads to problems.

It seems that at least the second node has more than one interfaces that are up and that this fact was introduced to the first node during the negotiation phase:

  • one configured with 128.2.100.167
  • one configured with 192.168.109.1 (do you have a tunnel or Xen running on the machine?)

The barrier communication happens over the first network and then the next MPI_Send tries to send to the second address over the second network which obviously does not connect all nodes.

The easiest solution is to tell Open MPI only to use the nework that connects your nodes. You can tell it do so using the following MCA parameter:

--mca btl_tcp_if_include 128.2.100.0/24

(or whatever your communication network is)

You can also specify the list of network interfaces if it is the same on all machines, e.g.

--mca btl_tcp_if_include eth0

or you can tell Open MPI to specifically exclude certain interfaces (but you must always tell it to exclude the loopback "lo" if you do so):

--mca btl_tcp_if_exclude lo,virt0

Hope that helps you and many others that appears to have the same problems around here at SO. It looks like that recently almost all Linux distros has started bringing up various network interfaces by default and that is likely to cause problems with Open MPI.

P.S. Put those nodes behind a firewall, please!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!