问题
Why do Merge()
and Disconnect()
freeze when I try to use mpi4py on CentOS 7? I'm using Python 2.7.5, mpi4py 2.0.0, and I had to load the openmpi/gnu/1.8.8
module.
I had trouble doing this under CentOS 6, and the only version of MPI that worked for me was openmpi/gnu/1.6.5
. Unfortunately, I don't see that version in the yum repositories for CentOS 7.
Is there a way to trace what's happening in mpi4py or MPI? Is there a way to get the older version of MPI on CentOS 7?
Here's the code I'm trying to run:
# mpi_spawn_test.py
import sys
from time import sleep
from mpi4py import MPI
WORKER_COMMAND = 'worker'
SHOULD_MERGE = False
SHOULD_DISCONNECT = False
def main():
command = len(sys.argv) > 1 and sys.argv[1] or '1'
if command != WORKER_COMMAND:
worker_count = int(command)
print('launching {} workers.'.format(worker_count))
comm = MPI.COMM_SELF.Spawn(sys.executable,
args=[sys.argv[0], WORKER_COMMAND],
maxprocs=worker_count)
print('launched workers.')
if SHOULD_MERGE:
comm = comm.Merge()
print("Merged workers.")
for i in range(worker_count):
msg = comm.recv(source=MPI.ANY_SOURCE)
print("Manager received {}.".format(msg))
print("Manager finished with fleet size {}.".format(comm.Get_size()))
else:
print('worker launched.')
comm = MPI.Comm.Get_parent()
print("Got parent.")
if SHOULD_MERGE:
comm = comm.Merge()
print("Merged parent.")
size = comm.Get_size()
rank = comm.Get_rank()
comm.send(rank, dest=0)
print("Finished worker: rank {} of {}".format(rank, size))
if SHOULD_DISCONNECT:
comm.Disconnect()
print("Finished with command {}.".format(command))
main()
I launch that with this command:
mpiexec -n 1 python mpi_spawn_test.py 3
Then I see this output:
launching 3 workers.
launched workers.
worker launched.
Got parent.
Finished worker: rank 1 of 3
Manager received 1.
worker launched.
Got parent.
worker launched.
Got parent.
Finished worker: rank 2 of 3
Manager received 0.
Finished worker: rank 0 of 3
Manager received 2.
Manager finished with fleet size 1.
If I set SHOULD_DISCONNECT
to True
, I see one or two "Finished with command worker." messages, then the process freezes.
If I set SHOULD_MERGE
to True
, I see the "launched workers" and "Got parent" messages, then the process freezes.
I got some hints from the MPI debugging page, but I don't really understand the debug output. As an example, here's a launch I tried:
mpiexec -mca btl_base_verbose 1 -mca state_base_verbose 1 -n 1 python mpi_spawn_test.py 3
Here's the verbose output:
[octomore:136217] [[12091,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:940
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:335
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:346
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:437
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:202
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE ALL DAEMONS REPORTED AT plm_rsh_module.c:1053
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE VM READY AT base/plm_base_launch_support.c:190
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING MAPPING AT base/plm_base_launch_support.c:227
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:253
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:476
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1613
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE RUNNING AT base/state_base_fns.c:487
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE SYNC REGISTERED AT base/state_base_fns.c:495
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:696
[octomore:136219] mca: bml: Using self btl to [[12091,1],0] on node octomore
launching 3 workers.
[octomore:136217] [[12091,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:940
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:335
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:346
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:437
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:202
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE ALL DAEMONS REPORTED AT plm_rsh_module.c:1053
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE VM READY AT base/plm_base_launch_support.c:190
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING MAPPING AT base/plm_base_launch_support.c:227
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:253
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:476
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1613
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE RUNNING AT base/state_base_fns.c:487
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE SYNC REGISTERED AT base/state_base_fns.c:495
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:696
[octomore:136221] mca: bml: Using self btl to [[12091,2],0] on node octomore
[octomore:136222] mca: bml: Using self btl to [[12091,2],1] on node octomore
[octomore:136223] mca: bml: Using self btl to [[12091,2],2] on node octomore
[octomore:136221] mca: bml: Using vader btl to [[12091,2],1] on node octomore
[octomore:136221] mca: bml: Using vader btl to [[12091,2],2] on node octomore
[octomore:136223] mca: bml: Using vader btl to [[12091,2],0] on node octomore
[octomore:136223] mca: bml: Using vader btl to [[12091,2],1] on node octomore
[octomore:136222] mca: bml: Using vader btl to [[12091,2],0] on node octomore
[octomore:136222] mca: bml: Using vader btl to [[12091,2],2] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
launched workers.
worker launched.
Got parent.
worker launched.
Got parent.
worker launched.
Got parent.
^C[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NOTIFY COMPLETED AT base/state_base_fns.c:724
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NOTIFY COMPLETED AT base/state_base_fns.c:724
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:443
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:443
[octomore:136217] [[12091,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:446
[octomore:136217] [[12091,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:446
来源:https://stackoverflow.com/questions/42446934/mpi4py-freezes-when-calling-merge-and-disconnect