RPi BLCR/MPICH Checkpoint/Restart issue

99封情书 提交于 2020-01-13 13:52:27

问题


After have been investigating my problem for weeks I have found some information from the hexdump of the context(I got one without C/R error (links at the end of this question, but no restart success)) (context-num0-0-0, DropBox)

<<cut>>
cri_sig_handle.Failed to reregister signal %d in process %d.  
Saw %p when expecting %p (%s) or %p (cri_sig_handler)....cri_run_sig_handler.
Failed to allocate signal %d in process %d: got signal %d instead...sigfillset() 
failed: %s.sigaction() failed: %s..LIBCR_DISABLE_NSCD 
<<cut>>

Seems that checkpointing works.

I have asked some questions concerning the C/R - Problem earlier.

Restart a mpi slave after checkpoint before failure on ARMv6
mpiexec checkpointing error (RPi)

I also tried to find any solution and followed this here:
MPICH2 Checkpointing Error with BLCR
enter link description here

No chance. I really despair slowly...

I'm sure that some of the HPC-Gurus are present here. If you could help me or just explain why it won't work (or why it doesn't work, maybe because I'm wrong somewhere) it would be the best Christmas Present for me.

[checkpoint got with this program]
test2.c

mpicc test2.c -o test2 -lcr -lcr_run
mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp  -ckpoint-interval 5 ./test2

(no chance)
mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp  -ckpoint-interval 5 -ckpoint-num 1-0-0
(or)
mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp  -ckpoint-interval 5 -ckpoint-num 1

pi@raspberrypi ~ $ ldd test2
        /usr/lib/arm-linux-gnueabihf/libcofi_rpi.so (0xb6f81000)
        libcr.so.0 => /usr/local/lib/libcr.so.0 (0xb6f71000)
        libcr_run.so.0 => /usr/local/lib/libcr_run.so.0 (0xb6f68000)
        librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0xb6f4e000)
        libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0xb6f2f000)
        libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xb6e00000)
        libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0xb6df5000)
        /lib/ld-linux-armhf.so.3 (0xb6f8e000)

I think my problem is somewhere here:

pi@raspberrypi ~ $ ps aux --forest | grep -B 5 defunct
pi        2711  0.0  0.2   2340  1044 ?        Ss   06:17   0:01  |       \_ /usr/lib/openssh/sftp-server
root      4023  0.0  0.7   9804  3204 ?        Ss   08:29   0:00  \_ sshd: pi [priv]  
pi        4030  0.0  0.3   9804  1520 ?        S    08:29   0:01  |   \_ sshd: pi@pts/1   
pi        4031  0.0  0.7   6252  3492 pts/1    Ss   08:29   0:03  |       \_ -bash
pi        4964  0.0  0.2   4452  1152 pts/1    R+   10:01   0:00  |           \_ ps aux --forest
pi        4965  0.0  0.1   3544   812 pts/1    S+   10:01   0:00  |           \_ grep --color=auto -B 5 defunct
root      4693  0.0  0.7   9804  3204 ?        Ss   09:29   0:00  \_ sshd: pi [priv]  
pi        4700  0.0  0.3   9804  1528 ?        S    09:29   0:00      \_ sshd: pi@pts/0   
pi        4701  0.1  0.7   6260  3508 pts/0    Ss   09:29   0:03          \_ -bash
pi        4948  0.0  0.2   3048  1036 pts/0    S+   09:58   0:00              \_ mpiexec -np 2 -launcher fork -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-num 1
pi        4949  0.0  0.1   3024   844 ?        Ss   09:58   0:00                  \_ /bin/hydra_pmi_proxy --control-port raspberrypi:38854 --rmk user --launcher fork --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi        4950  0.0  0.0      0     0 ?        Z    09:58   0:00                      \_ [hydra_pmi_proxy] <defunct>
root      2154  0.0  0.8  26524  3736 ?        Sl   05:38   0:01 /usr/sbin/console-kit-daemon --no-daemon
root      2221  0.0  0.6  22288  2920 ?        Sl   05:38   0:00 /usr/lib/policykit-1/polkitd --no-debug
pi        4878  0.0  0.1   3024   844 ?        Ss   09:45   0:00 /bin/hydra_pmi_proxy --control-port raspberrypi:38525 --rmk user --launcher fork --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi        4879  0.0  0.0      0     0 ?        Z    09:45   0:00  \_ [test2_] <defunct>
root      4910  0.0  0.1   3024   844 ?        Ss   09:56   0:00 /bin/hydra_pmi_proxy --control-port raspberrypi:49960 --rmk user --launcher fork --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
root      4911  0.0  0.0      0     0 ?        Z    09:56   0:00  \_ [hydra_pmi_proxy] <defunct>

root 4693  0.0 0.7 9804 3204 ? Ss 09:29 0:00 _ sshd: pi [priv]
pi     4700  0.0 0.3 9804 1528 ? S   09:29 0:00 _ sshd: pi@pts/0

UPDATE:(logs)

host: raspberrypi

==================================================================================================
mpiexec options:
----------------
  Base path: /bin/
  Launcher: (null)
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    MANPATH=:/usr/local/man
    SHELL=/bin/bash
    TERM=linux
    XDG_SESSION_COOKIE=40a78f4b44fccc138ec16d4052434c66-1387121860.786856-628795747
    HUSHLOGIN=FALSE
    USER=pi
    LD_LIBRARY_PATH=:/usr/local/lib:/usr/local/lib64
    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
    MAIL=/var/mail/pi
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games:/usr/games:/usr/local/bin
    PWD=/home/pi
    LANG=en_GB.UTF-8
    SHLVL=1
    HOME=/home/pi
    LOGNAME=pi
    _=/bin/mpiexec

  Hydra internal environment:
  ---------------------------
    MPICH_ENABLE_CKPOINT=1
    GFORTRAN_UNBUFFERED_PRECONNECTED=y


    Proxy information:
    *********************
      [1] proxy: raspberrypi (1 cores)
      Exec list: (null) (2 processes); 


==================================================================================================

[mpiexec@raspberrypi] Timeout set to -1 (-1 means infinite)
[mpiexec@raspberrypi] Got a control port string of raspberrypi:51572

Proxy launch args: /bin/hydra_pmi_proxy --control-port raspberrypi:51572 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 

Arguments being passed to proxy 0:
--version 1.5 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname raspberrypi --global-core-map 0,1,1 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_6057_0 --pmi-process-mapping (vector,(0,1,1)) --ckpointlib blcr --ckpoint-prefix /tmp/ --global-inherited-env 16 'MANPATH=:/usr/local/man' 'SHELL=/bin/bash' 'TERM=linux' 'XDG_SESSION_COOKIE=40a78f4b44fccc138ec16d4052434c66-1387121860.786856-628795747' 'HUSHLOGIN=FALSE' 'USER=pi' 'LD_LIBRARY_PATH=:/usr/local/lib:/usr/local/lib64' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:' 'MAIL=/var/mail/pi' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games:/usr/games:/usr/local/bin' 'PWD=/home/pi' 'LANG=en_GB.UTF-8' 'SHLVL=1' 'HOME=/home/pi' 'LOGNAME=pi' '_=/bin/mpiexec' --global-user-env 0 --global-system-env 2 'MPICH_ENABLE_CKPOINT=1' 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir /home/pi --exec-args 0 

[mpiexec@raspberrypi] Launch arguments: /bin/hydra_pmi_proxy --control-port raspberrypi:51572 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0 

hydra_pmi_proxy and test2 seems to be zombie

root      2221  0.0  0.6  22288  2920 ?        Sl   05:38   0:01 /usr/lib/policykit-1/polkitd --no-debug
root      5691  0.0  0.3   3796  1704 tty1     Ss   15:36   0:00 /bin/login --   
pi        5707  0.2  0.7   6308  3564 tty1     S    15:37   0:03  \_ -bash
pi        6057  0.1  0.2   3052  1044 tty1     S+   16:01   0:00      \_ mpiexec -n 2 -ckpointlib blcr -ckpoint-prefix /tmp/ -v -ckpoint-num 0
pi        6058  0.0  0.1   3024   844 ?        Ss   16:01   0:00          \_ /bin/hydra_pmi_proxy --control-port raspberrypi:51572 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi        6059  0.0  0.0      0     0 ?        Z    16:01   0:00              \_ [hydra_pmi_proxy] <defunct>
pi        5982  0.0  0.1   3024   844 ?        Ss   15:53   0:00 /bin/hydra_pmi_proxy --control-port raspberrypi:58709 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi        5983  0.0  0.0      0     0 ?        Z    15:53   0:00  \_ [test2_] <defunct>

来源:https://stackoverflow.com/questions/20794796/rpi-blcr-mpich-checkpoint-restart-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!