mpiexec checkpointing error (RPi)

匿名 (未验证) 提交于 2019-12-03 02:33:02

问题:

When I try to run an application (just a simple hello_world.c doesn't work) I receive this error every time:

mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 10 -machinefile /tmp/machinefile -n 1 ./app_name  [proxy:0:0@masterpi] requesting checkpoint [proxy:0:0@masterpi] checkpoint completed [proxy:0:0@masterpi] requesting checkpoint [proxy:0:0@masterpi] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@masterpi] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed [proxy:0:0@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0@masterpi] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec@masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed [mpiexec@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [mpiexec@masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event [mpiexec@masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion 

I want just to make a checkpoint and nothing else (and restart later).

Thanks in advance

UPDATE:

I have tried with MPICH2, no chance. Or maybe I'm wrong somewhere...

pi@raspberrypi ~ $ mpiexec -n 1 -ckpointlib blcr -ckpoint-prefix /tmp/  -ckpoint-interval 2 ./test3 Count to: 0 [proxy:0:0@raspberrypi] requesting checkpoint [proxy:0:0@raspberrypi] checkpoint completed Count to: 1 [proxy:0:0@raspberrypi] requesting checkpoint [proxy:0:0@raspberrypi] HYDT_ckpoint_checkpoint (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@raspberrypi] HYD_pmcd_pmip_control_cmd_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip_cb.c:902): checkpoint suspend failed [proxy:0:0@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event [mpiexec@raspberrypi] control_cb (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed [mpiexec@raspberrypi] HYDT_dmxu_poll_wait_for_event (/tmp/mpich/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status [mpiexec@raspberrypi] HYD_pmci_wait_for_completion (/tmp/mpich/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event [mpiexec@raspberrypi] main (/tmp/mpich/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion 

Test3-Code:

#include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <mpi.h>  int main(int argc, char* argv[]) {      int rank;     int size;     int i = 0;      MPI_Init(&argc, &argv);     MPI_Comm_rank(MPI_COMM_WORLD, &rank);     MPI_Comm_size(MPI_COMM_WORLD, &size);      MPI_Status status;      if (rank == 0) {         for(i; i <=100; i++){             int j = 0;             while(j < 100000000){                 j++;             }             printf("Count to: %i\n", i);         }     } else {     }      MPI_Finalize();     return 0;  } 

I just need to have one successful checkpoint and to show the restart. If someone has a working example (irrelevant what it makes, simple working "Hello World" would make me happy!) I would be very glad.

Happy new year!

回答1:

Unfortunately, the checkpoint/restart code in MPICH 3.0.4 is known to be buggy at the moment. That will hopefully get fixed in a future release. It looks like you're probably using it correctly. It's possible that if you go back to a previous version, you might have better luck.



回答2:

Here the problem was with the too small interval for checkpointing. Setting it to 20s or more has solved this (but not the other :( ) problem.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!