I have a central server where I periodically start a script (from cron) which checks remote servers. The check is performed serially, so first, one server then another ... .
This script (from the central server) starts another script(lets call it update.sh) on the remote machine, and that script(on the remote machine) is doing something like this:
processID=`pgrep "processName"` kill $processID startProcess.sh
The process is killed and then in the script startProcess.sh started like this:
pidof "processName" if [ ! $? -eq 0 ]; then nohup "processName" "processArgs" >> "processLog" & pidof "processName" if [! $? -eq 0]; then echo "Error: failed to start process" ...
The update.sh, startprocess.sh and the actual binary of the process that it starts is on a NFS mounted from the central server.
Now what happens sometimes, is that the process that I try to start within the startprocess.sh is not started and I get the error. The strange part is that it is random, sometime the process on one machine starts and another time on that same machine doesn't start. I'm checking about 300 servers and the errors are always random.
There is another thing, the remote servers are at 3 different geo locations (2 in America and 1 in Europe), the central server is in Europe. From what I discover so far is that the servers in America have much more errors than those in Europe.
First I thought that the error has to have something to do with kill so I added a sleep between the kill and the startprocess.sh but that didn't make any difference.
Also it seems that the process from startprocess.sh is not started at all, or something happens to it right when it is being started, because there is no output in the logfile and there should be an output in the logfile.
So, here I'm asking for help
Does anybody had this kind of problem, or know what might be wrong?
Thanks for any help
(Sorry, but my original answer was fairly wrong... Here is the correction)
Using $?
to get the exit status of the background process in startProcess.sh
leads to wrong result. Man bash states:
Special Parameters ? Expands to the status of the most recently executed foreground pipeline.
As You mentioned in your comment the proper way of getting the background process's exit status is using the wait
built in. But for this bash has to process the SIGCHLD signal.
I made a small test environment for this to show how it can work:
Here is a script loop.sh
to run as a background process:
#!/bin/bash [ "$1" == -x ] && exit 1; cnt=${1:-500} while ((++c<=cnt)); do echo "SLEEPING [$$]: $c/$cnt"; sleep 5; done
If the arg is -x
then it exits with exit status 1 to simulate an error. If arg is num, then waits num*5 seconds printing SLEEPING [<PID>] <counter>/<max_counter>
to stdout.
The second is the launcher script. It starts 3 loop.sh
scripts in the background and prints their exit status:
#!/bin/bash handle_chld() { local tmp=() for i in ${!pids[@]}; do if [ ! -d /proc/${pids[i]} ]; then wait ${pids[i]} echo "Stopped ${pids[i]}; exit code: $?" unset pids[i] fi done } set -o monitor trap "handle_chld" CHLD # Start background processes ./loop.sh 3 & pids+=($!) ./loop.sh 2 & pids+=($!) ./loop.sh -x & pids+=($!) # Wait until all background processes are stopped while [ ${#pids[@]} -gt 0 ]; do echo "WAITING FOR: ${pids[@]}"; sleep 2; done echo STOPPED
The handle_chld function will handle the SIGCHLD signals. Setting option monitor
enables for a non-interactive script to receive SIGCHLD. Then the trap is set for SIGCHLD signal.
Then background processes are started. All of their PIDs are remembered in pids
array. If SIGCHLD is received then it is checked amongst the /proc/ directories which child process was stopped (the missing one) (it could be also checked using kill -0 <PID>
bash built-in). After wait the exit status of the background process is stored in the famous $?
pseudo variable.
The main script waits for all pids to stop (otherwise it could not get the exit status of its children) and the it stops itself.
An example output:
WAITING FOR: 13102 13103 13104 SLEEPING [13103]: 1/2 SLEEPING [13102]: 1/3 Stopped 13104; exit code: 1 WAITING FOR: 13102 13103 WAITING FOR: 13102 13103 SLEEPING [13103]: 2/2 SLEEPING [13102]: 2/3 WAITING FOR: 13102 13103 WAITING FOR: 13102 13103 SLEEPING [13102]: 3/3 Stopped 13103; exit code: 0 WAITING FOR: 13102 WAITING FOR: 13102 WAITING FOR: 13102 Stopped 13102; exit code: 0 STOPPED
It can be seen that the exit codes are reported correctly.
I hope this can help a bit!