问题
I am trying to run a simple MPI program(multiple array addition), it runs perfectly in my PC but simply hangs or shows the following error in the cluster. I am using open mpi and the following command to execute
Netwok Config of the cluster(master&node1)
MASTER
eth0 Link encap:Ethernet HWaddr 00:22:19:A4:52:74
inet addr:10.1.1.1 Bcast:10.1.255.255 Mask:255.255.0.0
inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:16914 errors:0 dropped:0 overruns:0 frame:0
TX packets:7183 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2050581 (1.9 MiB) TX bytes:981632 (958.6 KiB)
eth1 Link encap:Ethernet HWaddr 00:22:19:A4:52:76
inet addr:192.168.41.203 Bcast:192.168.41.255 Mask:255.255.255.0
inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:701 errors:0 dropped:0 overruns:0 frame:0
TX packets:228 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:75457 (73.6 KiB) TX bytes:25295 (24.7 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:88362 errors:0 dropped:0 overruns:0 frame:0
TX packets:88362 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:21529504 (20.5 MiB) TX bytes:21529504 (20.5 MiB)
peth0 Link encap:Ethernet HWaddr 00:22:19:A4:52:74
inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:17175 errors:0 dropped:0 overruns:0 frame:0
TX packets:7257 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2373869 (2.2 MiB) TX bytes:1020320 (996.4 KiB)
Interrupt:16 Memory:da000000-da012800
peth1 Link encap:Ethernet HWaddr 00:22:19:A4:52:76
inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1112 errors:0 dropped:0 overruns:0 frame:0
TX packets:302 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:168837 (164.8 KiB) TX bytes:33241 (32.4 KiB)
Interrupt:16 Memory:d6000000-d6012800
virbr0 Link encap:Ethernet HWaddr 52:54:00:E3:80:BC
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
NODE 1
eth0 Link encap:Ethernet HWaddr 00:22:19:53:42:C6
inet addr:10.1.255.253 Bcast:10.1.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:16559 errors:0 dropped:0 overruns:0 frame:0
TX packets:7299 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1898811 (1.8 MiB) TX bytes:1056294 (1.0 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:25 errors:0 dropped:0 overruns:0 frame:0
TX packets:25 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3114 (3.0 KiB) TX bytes:3114 (3.0 KiB)
peth0 Link encap:Ethernet HWaddr 00:22:19:53:42:C6
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:16913 errors:0 dropped:0 overruns:0 frame:0
TX packets:7276 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2221627 (2.1 MiB) TX bytes:1076708 (1.0 MiB)
Interrupt:16 Memory:f8000000-f8012800
virbr0 Link encap:Ethernet HWaddr 52:54:00:E7:E5:FF
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Error
mpirun -machinefile machine -np 4 ./query
error code:
[[22877,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.122.1 failed: Connection refused (111)
Code
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define group MPI_COMM_WORLD
#define root 0
#define size 100
int main(int argc,char *argv[])
{
int no_tasks,task_id,i;
MPI_Init(&argc,&argv);
MPI_Comm_size(group,&no_tasks);
MPI_Comm_rank(group,&task_id);
int arr1[size],arr2[size],local1[size],local2[size];
if(task_id==root)
{
for(i=0;i<size;i++)
{
arr1[i]=arr2[i]=i;
}
}
MPI_Scatter(arr1,size/no_tasks,MPI_INT,local1,size/no_tasks,MPI_INT,root,group);
MPI_Scatter(arr2,size/no_tasks,MPI_INT,local2,size/no_tasks,MPI_INT,root,group);
for(i=0;i<size/no_tasks;i++)
{
local1[i]+=local2[i];
}
MPI_Gather(local1,size/no_tasks,MPI_INT,arr1,size/no_tasks,MPI_INT,root,group);
if(task_id==root)
{
printf("The Array Sum Is\n");
for(i=0;i<size;i++)
{
printf("%d ",arr1[i]);
}
}
MPI_Finalize();
return 0;
}
回答1:
Tell Open MPI not to use the virtual bridge interface virbr0
interface for sending messages over TCP/IP. Or better tell it to only use eth0
for the purpose:
$ mpiexec --mca btl_tcp_if_include eth0 ...
This comes from the greedy behaviour of Open MPI's tcp
BTL component that transmits messages using TCP/IP. It tries to use all of the available network interfaces that are up on each node in order to maximise the data bandwidth. Both nodes have virbr0
configured with the same subnet address. Open MPI falls to recognise that both addresses are equal, but since the subnets match, it assumes that it should be able to talk over virbr0
. So process A is trying to send a message to process B, which resides on the other node. Process B listens on port P
and process A knows this, so it tries to connect to 192.168.122.1:P
. But this is actually the address given to the virbr0
interface on the node where process A is, so the node tries to talk to itself on a non-existent port, hence the "connection refused" error.
来源:https://stackoverflow.com/questions/15227933/cluster-hangs-shows-error-while-executing-simple-mpi-program-in-c