问题
After have been investigating my problem for weeks I have found some information from the hexdump of the context(I got one without C/R error (links at the end of this question, but no restart success)) (context-num0-0-0, DropBox)
<<cut>>
cri_sig_handle.Failed to reregister signal %d in process %d.
Saw %p when expecting %p (%s) or %p (cri_sig_handler)....cri_run_sig_handler.
Failed to allocate signal %d in process %d: got signal %d instead...sigfillset()
failed: %s.sigaction() failed: %s..LIBCR_DISABLE_NSCD
<<cut>>
Seems that checkpointing works.
I have asked some questions concerning the C/R - Problem earlier.
Restart a mpi slave after checkpoint before failure on ARMv6
mpiexec checkpointing error (RPi)
I also tried to find any solution and followed this here:
MPICH2 Checkpointing Error with BLCR
enter link description here
No chance. I really despair slowly...
I'm sure that some of the HPC-Gurus are present here. If you could help me or just explain why it won't work (or why it doesn't work, maybe because I'm wrong somewhere) it would be the best Christmas Present for me.
[checkpoint got with this program]
test2.c
mpicc test2.c -o test2 -lcr -lcr_run
mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-interval 5 ./test2
(no chance)
mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-interval 5 -ckpoint-num 1-0-0
(or)
mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-interval 5 -ckpoint-num 1
pi@raspberrypi ~ $ ldd test2
/usr/lib/arm-linux-gnueabihf/libcofi_rpi.so (0xb6f81000)
libcr.so.0 => /usr/local/lib/libcr.so.0 (0xb6f71000)
libcr_run.so.0 => /usr/local/lib/libcr_run.so.0 (0xb6f68000)
librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0xb6f4e000)
libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0xb6f2f000)
libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xb6e00000)
libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0xb6df5000)
/lib/ld-linux-armhf.so.3 (0xb6f8e000)
I think my problem is somewhere here:
pi@raspberrypi ~ $ ps aux --forest | grep -B 5 defunct
pi 2711 0.0 0.2 2340 1044 ? Ss 06:17 0:01 | \_ /usr/lib/openssh/sftp-server
root 4023 0.0 0.7 9804 3204 ? Ss 08:29 0:00 \_ sshd: pi [priv]
pi 4030 0.0 0.3 9804 1520 ? S 08:29 0:01 | \_ sshd: pi@pts/1
pi 4031 0.0 0.7 6252 3492 pts/1 Ss 08:29 0:03 | \_ -bash
pi 4964 0.0 0.2 4452 1152 pts/1 R+ 10:01 0:00 | \_ ps aux --forest
pi 4965 0.0 0.1 3544 812 pts/1 S+ 10:01 0:00 | \_ grep --color=auto -B 5 defunct
root 4693 0.0 0.7 9804 3204 ? Ss 09:29 0:00 \_ sshd: pi [priv]
pi 4700 0.0 0.3 9804 1528 ? S 09:29 0:00 \_ sshd: pi@pts/0
pi 4701 0.1 0.7 6260 3508 pts/0 Ss 09:29 0:03 \_ -bash
pi 4948 0.0 0.2 3048 1036 pts/0 S+ 09:58 0:00 \_ mpiexec -np 2 -launcher fork -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-num 1
pi 4949 0.0 0.1 3024 844 ? Ss 09:58 0:00 \_ /bin/hydra_pmi_proxy --control-port raspberrypi:38854 --rmk user --launcher fork --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi 4950 0.0 0.0 0 0 ? Z 09:58 0:00 \_ [hydra_pmi_proxy] <defunct>
root 2154 0.0 0.8 26524 3736 ? Sl 05:38 0:01 /usr/sbin/console-kit-daemon --no-daemon
root 2221 0.0 0.6 22288 2920 ? Sl 05:38 0:00 /usr/lib/policykit-1/polkitd --no-debug
pi 4878 0.0 0.1 3024 844 ? Ss 09:45 0:00 /bin/hydra_pmi_proxy --control-port raspberrypi:38525 --rmk user --launcher fork --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi 4879 0.0 0.0 0 0 ? Z 09:45 0:00 \_ [test2_] <defunct>
root 4910 0.0 0.1 3024 844 ? Ss 09:56 0:00 /bin/hydra_pmi_proxy --control-port raspberrypi:49960 --rmk user --launcher fork --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
root 4911 0.0 0.0 0 0 ? Z 09:56 0:00 \_ [hydra_pmi_proxy] <defunct>
root 4693 0.0 0.7 9804 3204 ? Ss 09:29 0:00 _ sshd: pi [priv]
pi 4700 0.0 0.3 9804 1528 ? S 09:29 0:00 _ sshd: pi@pts/0
UPDATE:(logs)
host: raspberrypi
==================================================================================================
mpiexec options:
----------------
Base path: /bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
MANPATH=:/usr/local/man
SHELL=/bin/bash
TERM=linux
XDG_SESSION_COOKIE=40a78f4b44fccc138ec16d4052434c66-1387121860.786856-628795747
HUSHLOGIN=FALSE
USER=pi
LD_LIBRARY_PATH=:/usr/local/lib:/usr/local/lib64
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
MAIL=/var/mail/pi
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games:/usr/games:/usr/local/bin
PWD=/home/pi
LANG=en_GB.UTF-8
SHLVL=1
HOME=/home/pi
LOGNAME=pi
_=/bin/mpiexec
Hydra internal environment:
---------------------------
MPICH_ENABLE_CKPOINT=1
GFORTRAN_UNBUFFERED_PRECONNECTED=y
Proxy information:
*********************
[1] proxy: raspberrypi (1 cores)
Exec list: (null) (2 processes);
==================================================================================================
[mpiexec@raspberrypi] Timeout set to -1 (-1 means infinite)
[mpiexec@raspberrypi] Got a control port string of raspberrypi:51572
Proxy launch args: /bin/hydra_pmi_proxy --control-port raspberrypi:51572 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id
Arguments being passed to proxy 0:
--version 1.5 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname raspberrypi --global-core-map 0,1,1 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_6057_0 --pmi-process-mapping (vector,(0,1,1)) --ckpointlib blcr --ckpoint-prefix /tmp/ --global-inherited-env 16 'MANPATH=:/usr/local/man' 'SHELL=/bin/bash' 'TERM=linux' 'XDG_SESSION_COOKIE=40a78f4b44fccc138ec16d4052434c66-1387121860.786856-628795747' 'HUSHLOGIN=FALSE' 'USER=pi' 'LD_LIBRARY_PATH=:/usr/local/lib:/usr/local/lib64' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:' 'MAIL=/var/mail/pi' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games:/usr/games:/usr/local/bin' 'PWD=/home/pi' 'LANG=en_GB.UTF-8' 'SHLVL=1' 'HOME=/home/pi' 'LOGNAME=pi' '_=/bin/mpiexec' --global-user-env 0 --global-system-env 2 'MPICH_ENABLE_CKPOINT=1' 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir /home/pi --exec-args 0
[mpiexec@raspberrypi] Launch arguments: /bin/hydra_pmi_proxy --control-port raspberrypi:51572 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
hydra_pmi_proxy and test2 seems to be zombie
root 2221 0.0 0.6 22288 2920 ? Sl 05:38 0:01 /usr/lib/policykit-1/polkitd --no-debug
root 5691 0.0 0.3 3796 1704 tty1 Ss 15:36 0:00 /bin/login --
pi 5707 0.2 0.7 6308 3564 tty1 S 15:37 0:03 \_ -bash
pi 6057 0.1 0.2 3052 1044 tty1 S+ 16:01 0:00 \_ mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp/ -v -ckpoint-num 0
pi 6058 0.0 0.1 3024 844 ? Ss 16:01 0:00 \_ /bin/hydra_pmi_proxy --control-port raspberrypi:51572 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi 6059 0.0 0.0 0 0 ? Z 16:01 0:00 \_ [hydra_pmi_proxy] <defunct>
pi 5982 0.0 0.1 3024 844 ? Ss 15:53 0:00 /bin/hydra_pmi_proxy --control-port raspberrypi:58709 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi 5983 0.0 0.0 0 0 ? Z 15:53 0:00 \_ [test2_] <defunct>
来源:https://stackoverflow.com/questions/20794796/rpi-blcr-mpich-checkpoint-restart-issue