High availability computing: How to deal with a non-returning system call, without risking false positives?

问题

I have a process that's running on a Linux computer as part of a high-availability system. The process has a main thread that receives requests from the other computers on the network and responds to them. There is also a heartbeat thread that sends out multicast heartbeat packets periodically, to let the other processes on the network know that this process is still alive and available -- if they don't heart any heartbeat packets from it for a while, one of them will assume this process has died and will take over its duties, so that the system as a whole can continue to work.

This all works pretty well, but the other day the entire system failed, and when I investigated why I found the following:

Due to (what is apparently) a bug in the box's Linux kernel, there was a kernel "oops" induced by a system call that this process's main thread made.
Because of the kernel "oops", the system call never returned, leaving the process's main thread permanently hung.
The heartbeat thread, OTOH, continue to operate correctly, which meant that the other nodes on the network never realized that this node had failed, and none of them stepped in to take over its duties... and so the requested tasks were not performed and the system's operation effectively halted.

My question is, is there an elegant solution that can handle this sort of failure? (Obviously one thing to do is fix the Linux kernel so it doesn't "oops", but given the complexity of the Linux kernel, it would be nice if my software could handle future other kernel bugs more gracefully as well).

One solution I don't like would be to put the heartbeat generator into the main thread, rather than running it as a separate thread, or in some other way tie it to the main thread so that if the main thread gets hung up indefinitely, heartbeats won't get sent. The reason I don't like this solution is because the main thread is not a real-time thread, and so doing this would introduce the possibility of occasional false-positives where a slow-to-complete operation was mistaken for a node failure. I'd like to avoid false positives if I can.

Ideally there would be some way to ensure that a failed syscall either returns an error code, or if that's not possible, crashes my process; either of those would halt the generation of heartbeat packets and allow a failover to proceed. Is there any way to do that, or does an unreliable kernel doom my user process to unreliability as well?

回答1:

My second suggestion is to use ptrace to find the current instruction pointer. You can have a parent thread that ptraces your process and interrupts it every second to check the current RIP value. This is somewhat complex, so I've written a demonstration program: (x86_64 only, but that should be fixable by changing the register names.)

#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <linux/ptrace.h>
#include <sys/user.h>
#include <time.h>

// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)

int main_thread(void *ptr) {
    // "main" thread is now running under the monitor
    printf("Hello from main!");
    while (1) {
        int c = getchar();
        if (c == EOF) { break; }
        nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
        putchar(c);
    }
    return 0;
}

int main(int argc, char *argv[]) {
    void *vstack = malloc(STACK_SIZE);
    pid_t v;
    if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
        perror("failed to spawn child task");
        return 3;
    }
    printf("Target: %d; %d\n", v, getpid());
    long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
    if (ptv == -1) {
        perror("failed monitor sieze");
        exit(1);
    }
    struct user_regs_struct regs;
    fprintf(stderr, "beginning monitor...\n");
    while (1) {
        sleep(1);
        long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to interrupt main thread");
            break;
        }
        int status;
        if (waitpid(v, &status, __WCLONE) == -1) {
            perror("target wait failed");
            break;
        }
        if (!WIFSTOPPED(status)) { // this section is messy. do it better.
            fputs("target wait went wrong", stderr);
            break;
        }
        if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
            fputs("target wait went wrong (2)", stderr);
            break;
        }
        ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
        if (ptv == -1) {
            perror("failed to peek at registers of thread");
            break;
        }
        fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
        ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to resume main thread");
            break;
        }
    }
    return 2;
}

Note that this is not production-quality code. You'll need to do a bunch of fixing things up.

Based on this, you should be able to figure out whether or not the program counter is advancing, and could combine this with other pieces of information (such as /proc/PID/status) to find if it's busy in a system call. You might also be able to extend the usage of ptrace to check what system calls are being used, so that you can check if it's a reasonable one to be waiting on.

This is a hacky solution, but I don't think that you'll find a non-hacky solution for this problem. Despite the hackiness, I don't think (this is untested) that it would be particularly slow; my implementation pauses the monitored thread once per second for a very short amount of time - which I would guess would be in the 100s of microseconds range. That's around 0.01% efficiency loss, theoretically.

回答2:

I think you need a shared activity marker.

Have the main thread (or in a more general application, all worker threads) update the shared activity marker with the current time (or clock tick, e.g. by computing the "current" nanosecond from clock_gettime(CLOCK_MONOTONIC, ...)), and have the heartbeat thread periodically check when this activity marker was last updated, cancelling itself (and thus stopping the heartbeat broadcast) if there has not been any activity update within a reasonable time.

This scheme can easily be extended with a state flag if the workload is very sporadic. The main work thread sets the flag and updates the activity marker when it begins a unit of work, and clears the flag when the work has completed. If there is no work being done then the heartbeat is sent without checking the activity marker. If work is being done then the heartbeat is stopped if the time since the activity marker was updated exceeds the maximum processing time allowed for a unit of work. (Multiple worker threads each need their own activity marker and flag in this case, and the heartbeat thread can be designed to stop when any one worker thread gets stuck, or only when all worker threads get stuck, depending on their purposes and importance to the overall system).

(The activity marker value (and the work flag) will of course have to be protected by a mutex that must be acquired before reading or writing the value.)

Perhaps the heartbeat thread can also cause the whole process to commit suicide (e.g. kill(getpid(), SIGQUIT)) so that it can be restarted by having it be called in a loop in a wrapper script, especially if a process restart clears the condition in the kernel which would cause the problem in the first place.

回答3:

One possible method would be to have another set of heartbeat messages from the main thread to the heartbeat thread. If it stops receiving messages for a certain amount of time, it stops sending them out as well. (And could try other recovery such as restarting the process.)

To solve the issue of the main thread actually just being in a long sleep, have a (properly-synchronized) flag that the heartbeat thread sets when it has decided that the main thread must have failed - and the main thread should check this flag at appropriate times (e.g. after the potential wait) to make sure that it hasn't been reported as dead. If it has, it stops running, because its job would have already been taken up by a different node.

The main thread can also send I-am-alive events to the heartbeat thread at other times than once around the loop - for example, if it's going into a long-running operation. Without this, there's no way to tell the difference between a failed main thread and a sleeping main thread.

来源：https://stackoverflow.com/questions/30061396/high-availability-computing-how-to-deal-with-a-non-returning-system-call-witho

标签

Linux

high-availability

failover

heartbeat