问题
I used MPI_Isend
to transfer an array of chars to slave node. When the size of the array is small it worked, but when I enlarge the size of the array, it hanged there.
Code running on the master node (rank 0) :
MPI_Send(&text_length,1,MPI_INT,dest,MSG_TEXT_LENGTH,MPI_COMM_WORLD);
MPI_Isend(text->chars, 360358,MPI_CHAR,dest,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);
Code running on slave node (rank 1):
MPI_Recv(&count,1,MPI_INT,0,MSG_TEXT_LENGTH,MPI_COMM_WORLD,&status);
MPI_Irecv(host_read_string,count,MPI_CHAR,0,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);
You see the count param in MPI_Isend
is 360358
. It seemed too large for MPI
. When I set the param 1024
, it worked well.
Actually this problem has confused me a few days, I have known that there's limit on the size of data transferred by MPI
. But as far as I know, the MPI_Send
is used to send short messages, and the MPI_Isend
can send larger messages. So I use MPI_Isend
.
The network configure in rank 0 is:
[12t2007@comp01-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:1B:21:D9:79:A5
inet addr:192.168.0.101 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:393267 errors:0 dropped:0 overruns:0 frame:0
TX packets:396421 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:35556328 (33.9 MiB) TX bytes:79580008 (75.8 MiB)
eth0.2002 Link encap:Ethernet HWaddr 00:1B:21:D9:79:A5
inet addr:10.111.2.36 Bcast:10.111.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:133577 errors:0 dropped:0 overruns:0 frame:0
TX packets:127677 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:14182652 (13.5 MiB) TX bytes:17504189 (16.6 MiB)
eth1 Link encap:Ethernet HWaddr 00:1B:21:D9:79:A4
inet addr:192.168.1.101 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:206981 errors:0 dropped:0 overruns:0 frame:0
TX packets:303185 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:168952610 (161.1 MiB) TX bytes:271792020 (259.2 MiB)
eth2 Link encap:Ethernet HWaddr 00:25:90:91:6B:56
inet addr:10.111.1.36 Bcast:10.111.1.255 Mask:255.255.254.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:26459977 errors:0 dropped:0 overruns:0 frame:0
TX packets:15700862 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12533940345 (11.6 GiB) TX bytes:2078001873 (1.9 GiB)
Memory:fb120000-fb140000
eth3 Link encap:Ethernet HWaddr 00:25:90:91:6B:57
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:fb100000-fb120000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1894012 errors:0 dropped:0 overruns:0 frame:0
TX packets:1894012 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:154962344 (147.7 MiB) TX bytes:154962344 (147.7 MiB)
The network configure in rank 1 is:
[12t2007@comp02-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:1B:21:D9:79:5F
inet addr:192.168.0.102 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:328449 errors:0 dropped:0 overruns:0 frame:0
TX packets:278631 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:47679329 (45.4 MiB) TX bytes:39326294 (37.5 MiB)
eth0.2002 Link encap:Ethernet HWaddr 00:1B:21:D9:79:5F
inet addr:10.111.2.37 Bcast:10.111.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:94126 errors:0 dropped:0 overruns:0 frame:0
TX packets:53782 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:8313498 (7.9 MiB) TX bytes:6929260 (6.6 MiB)
eth1 Link encap:Ethernet HWaddr 00:1B:21:D9:79:5E
inet addr:192.168.1.102 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:121527 errors:0 dropped:0 overruns:0 frame:0
TX packets:41865 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:158117588 (150.7 MiB) TX bytes:5084830 (4.8 MiB)
eth2 Link encap:Ethernet HWaddr 00:25:90:91:6B:50
inet addr:10.111.1.37 Bcast:10.111.1.255 Mask:255.255.254.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:26337628 errors:0 dropped:0 overruns:0 frame:0
TX packets:15500750 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12526923258 (11.6 GiB) TX bytes:2032767897 (1.8 GiB)
Memory:fb120000-fb140000
eth3 Link encap:Ethernet HWaddr 00:25:90:91:6B:51
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:fb100000-fb120000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1895944 errors:0 dropped:0 overruns:0 frame:0
TX packets:1895944 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:154969511 (147.7 MiB) TX bytes:154969511 (147.7 MiB)
回答1:
The peculiarities of using TCP/IP with Open MPI are described in the FAQ. I'll try to give an executive summary here.
Open MPI uses a greedy approach when it comes to utilising network interfaces for data exchange. In particular, the TCP/IP BTL (Byte Transfer Layer) and OOB (Out-Of-Band) components tcp
will try to use all configured network interfaces with matching address families. In your case each node has many interfaces with addresses from the IPv4 address family:
comp01-mpi comp02-mpi
----------------------------------------------------------
eth0 192.168.0.101/24 eth0 192.168.0.102/24
eth0.2002 10.111.2.36/24 eth0.2002 10.111.2.37/24
eth1 192.168.1.101/24 eth1 192.168.1.102/24
eth2 10.111.1.36/23 eth2 10.111.1.37/23
lo 127.0.0.1/8 lo 127.0.0.1/8
Open MPI assumes that each interface on comp02-mpi
is reachable from any interface on comp01-mpi
and vice versa. This is never the case with the loopback interface lo
, therefore by default Open MPI excludes lo
. Network sockets are then opened lazily (e.g. on demand) when information has to be transported.
What happens in your case is that when transporting messages, Open MPI chops them down into fragments and then tries to send the different segments over different connections in order to maximise the bandwidth. By default the fragments are of size 128 KiB, which only holds 32768 int
elements, also the very first (eager) fragment is of size 64 KiB and holds twice as less elements. It might happen that the assumption that each interface on comp01-mpi
is reachable from each interface on comp02-mpi
(and vice versa) is wrong, e.g. if some of them are connected to separate isolated networks. In that case the library will be stuck in trying to make a connection that can never happen and the program will hang. This should usually happen for messages of more than 16384 int
elements.
To prevent the above mentioned situation, one can restrict the interfaces or networks that Open MPI uses for TCP/IP communication. The btl_tcp_if_include
MCA parameter can be used to provide the library with the list of interfaces that it should use. The btl_tcp_if_exclude
can be used to instruct the library which interfaces to exclude. That one is set to lo
by default and if one would like to exclude specific interface(s), then one should explicitly add lo
to the list.
Everything from above also applies to the out-of-band communication used to transport special information. The parameters for selecting or deselecting interfaces for OOB are oob_tcp_if_include
and oob_tcp_if_exclude
conversely. Those are usually set together with the BTL parameters. Therefore you should try setting those to combinations that actually work. Start by narrowing the selection down the a single interface:
mpiexec --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 ...
If it doesn't work with eth0
, try other interfaces.
The presence of the virtual interface eth0.2002
is going to further confuse Open MPI 1.6.2 and newer.
回答2:
I think instead of 360358
you must use text_length
while sending the message (in function MPI_Isend
) from rank=0.
I believe the reason why you are sending the length first is to eliminate the need to hard-code the number, then why are you putting the number of elements when sending to other nodes? That could be the reason for inconsistency at the receiving end.
The signature of MPI_Isend is:
int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
where count
is the number of elements in send buffer. It might lead to segmentation fault if count
is more than the array length, or will lead to sending less elements across if the count
is less.
来源:https://stackoverflow.com/questions/22475816/unable-to-run-mpi-when-transfering-large-data