High availability computing: How to deal with a non-returning system call, without risking false positives?

前端 未结 3 442
野的像风
野的像风 2021-01-27 00:18

I have a process that\'s running on a Linux computer as part of a high-availability system. The process has a main thread that receives requests from the other computers on the

3条回答
  •  暖寄归人
    2021-01-27 00:42

    I think you need a shared activity marker.

    Have the main thread (or in a more general application, all worker threads) update the shared activity marker with the current time (or clock tick, e.g. by computing the "current" nanosecond from clock_gettime(CLOCK_MONOTONIC, ...)), and have the heartbeat thread periodically check when this activity marker was last updated, cancelling itself (and thus stopping the heartbeat broadcast) if there has not been any activity update within a reasonable time.

    This scheme can easily be extended with a state flag if the workload is very sporadic. The main work thread sets the flag and updates the activity marker when it begins a unit of work, and clears the flag when the work has completed. If there is no work being done then the heartbeat is sent without checking the activity marker. If work is being done then the heartbeat is stopped if the time since the activity marker was updated exceeds the maximum processing time allowed for a unit of work. (Multiple worker threads each need their own activity marker and flag in this case, and the heartbeat thread can be designed to stop when any one worker thread gets stuck, or only when all worker threads get stuck, depending on their purposes and importance to the overall system).

    (The activity marker value (and the work flag) will of course have to be protected by a mutex that must be acquired before reading or writing the value.)

    Perhaps the heartbeat thread can also cause the whole process to commit suicide (e.g. kill(getpid(), SIGQUIT)) so that it can be restarted by having it be called in a loop in a wrapper script, especially if a process restart clears the condition in the kernel which would cause the problem in the first place.

提交回复
热议问题