mpiexec checkpointing error (RPi)

匿名 (未验证) 提交于 2019-12-03 02:33:02


When I try to run an application (just a simple hello_world.c doesn't work) I receive this error every time:

mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 10 -machinefile /tmp/machinefile -n 1 ./app_name  [proxy:0:0@masterpi] requesting checkpoint [proxy:0:0@masterpi] checkpoint completed [proxy:0:0@masterpi] requesting checkpoint [proxy:0:0@masterpi] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@masterpi] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed [proxy:0:0@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0@masterpi] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec@masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed [mpiexec@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [mpiexec@masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event [mpiexec@masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion 

I want just to make a checkpoint and nothing else (and restart later).

Thanks in advance


I have tried with MPICH2, no chance. Or maybe I'm wrong somewhere...

pi@raspberrypi ~ $ mpiexec -n 1 -ckpointlib blcr -ckpoint-prefix /tmp/  -ckpoint-interval 2 ./test3 Count to: 0 [proxy:0:0@raspberrypi] requesting checkpoint [proxy:0:0@raspberrypi] checkpoint completed Count to: 1 [proxy:0:0@raspberrypi] requesting checkpoint [proxy:0:0@raspberrypi] HYDT_ckpoint_checkpoint (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@raspberrypi] HYD_pmcd_pmip_control_cmd_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip_cb.c:902): checkpoint suspend failed [proxy:0:0@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event [mpiexec@raspberrypi] control_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed [mpiexec@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status [mpiexec@raspberrypi] HYD_pmci_wait_for_completion (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event [mpiexec@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion 


#include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <mpi.h>  int main(int argc, char* argv[]) {      int rank;     int size;     int i = 0;      MPI_Init(&argc, &argv);     MPI_Comm_rank(MPI_COMM_WORLD, &rank);     MPI_Comm_size(MPI_COMM_WORLD, &size);      MPI_Status status;      if (rank == 0) {         for(i; i <=100; i++){             int j = 0;             while(j < 100000000){                 j++;             }             printf("Count to: %i\n", i);         }     } else {     }      MPI_Finalize();     return 0;  } 

I just need to have one successful checkpoint and to show the restart. If someone has a working example (irrelevant what it makes, simple working "Hello World" would make me happy!) I would be very glad.

Happy new year!


Unfortunately, the checkpoint/restart code in MPICH 3.0.4 is known to be buggy at the moment. That will hopefully get fixed in a future release. It looks like you're probably using it correctly. It's possible that if you go back to a previous version, you might have better luck.


Here the problem was with the too small interval for checkpointing. Setting it to 20s or more has solved this (but not the other :( ) problem.
