Starting a process from bash script failed

匿名 (未验证) 提交于 2019-12-03 00:56:02

问题:

I have a central server where I periodically start a script (from cron) which checks remote servers. The check is performed serially, so first, one server then another ... .

This script (from the central server) starts another script(lets call it update.sh) on the remote machine, and that script(on the remote machine) is doing something like this:

processID=`pgrep "processName"`  kill $processID startProcess.sh 

The process is killed and then in the script startProcess.sh started like this:

pidof "processName"  if [ ! $? -eq 0 ]; then     nohup "processName" "processArgs" >> "processLog" &     pidof "processName"     if [! $? -eq 0]; then         echo "Error: failed to start process" ... 

The update.sh, startprocess.sh and the actual binary of the process that it starts is on a NFS mounted from the central server.

Now what happens sometimes, is that the process that I try to start within the startprocess.sh is not started and I get the error. The strange part is that it is random, sometime the process on one machine starts and another time on that same machine doesn't start. I'm checking about 300 servers and the errors are always random.

There is another thing, the remote servers are at 3 different geo locations (2 in America and 1 in Europe), the central server is in Europe. From what I discover so far is that the servers in America have much more errors than those in Europe.

First I thought that the error has to have something to do with kill so I added a sleep between the kill and the startprocess.sh but that didn't make any difference.

Also it seems that the process from startprocess.sh is not started at all, or something happens to it right when it is being started, because there is no output in the logfile and there should be an output in the logfile.

So, here I'm asking for help

Does anybody had this kind of problem, or know what might be wrong?

Thanks for any help

回答1:

(Sorry, but my original answer was fairly wrong... Here is the correction)

Using $? to get the exit status of the background process in startProcess.sh leads to wrong result. Man states:

Special Parameters ?      Expands to the status of the most recently executed foreground        pipeline. 

As You mentioned in your comment the proper way of getting the background process's exit status is using the wait built in. But for this has to process the SIGCHLD signal.

I made a small test environment for this to show how it can work:

Here is a script loop.sh to run as a background process:

#!/bin/bash [ "$1" == -x ] && exit 1; cnt=${1:-500} while ((++c<=cnt)); do echo "SLEEPING [$$]: $c/$cnt"; sleep 5; done 

If the arg is -x then it exits with exit status 1 to simulate an error. If arg is num, then waits num*5 seconds printing SLEEPING [<PID>] <counter>/<max_counter> to stdout.

The second is the launcher script. It starts 3 loop.sh scripts in the background and prints their exit status:

#!/bin/bash  handle_chld() {     local tmp=()     for i in ${!pids[@]}; do         if [ ! -d /proc/${pids[i]} ]; then             wait ${pids[i]}             echo "Stopped ${pids[i]}; exit code: $?"             unset pids[i]         fi     done }  set -o monitor trap "handle_chld" CHLD  # Start background processes ./loop.sh 3 & pids+=($!) ./loop.sh 2 & pids+=($!) ./loop.sh -x & pids+=($!)  # Wait until all background processes are stopped while [ ${#pids[@]} -gt 0 ]; do echo "WAITING FOR: ${pids[@]}"; sleep 2; done echo STOPPED 

The handle_chld function will handle the SIGCHLD signals. Setting option monitor enables for a non-interactive script to receive SIGCHLD. Then the trap is set for SIGCHLD signal.

Then background processes are started. All of their PIDs are remembered in pids array. If SIGCHLD is received then it is checked amongst the /proc/ directories which child process was stopped (the missing one) (it could be also checked using kill -0 <PID> built-in). After wait the exit status of the background process is stored in the famous $? pseudo variable.

The main script waits for all pids to stop (otherwise it could not get the exit status of its children) and the it stops itself.

An example output:

WAITING FOR: 13102 13103 13104 SLEEPING [13103]: 1/2 SLEEPING [13102]: 1/3 Stopped 13104; exit code: 1 WAITING FOR: 13102 13103 WAITING FOR: 13102 13103 SLEEPING [13103]: 2/2 SLEEPING [13102]: 2/3 WAITING FOR: 13102 13103 WAITING FOR: 13102 13103 SLEEPING [13102]: 3/3 Stopped 13103; exit code: 0 WAITING FOR: 13102 WAITING FOR: 13102 WAITING FOR: 13102 Stopped 13102; exit code: 0 STOPPED 

It can be seen that the exit codes are reported correctly.

I hope this can help a bit!



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!