Compiling an application for use in highly radioactive environments

后端 未结 23 1616
名媛妹妹
名媛妹妹 2020-12-02 03:30

We are compiling an embedded C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation. We are using GCC and cross-compiling

相关标签:
23条回答
  • 2020-12-02 03:45

    You may also be interested in the rich literature on the subject of algorithmic fault tolerance. This includes the old assignment: Write a sort that correctly sorts its input when a constant number of comparisons will fail (or, the slightly more evil version, when the asymptotic number of failed comparisons scales as log(n) for n comparisons).

    A place to start reading is Huang and Abraham's 1984 paper "Algorithm-Based Fault Tolerance for Matrix Operations". Their idea is vaguely similar to homomorphic encrypted computation (but it is not really the same, since they are attempting error detection/correction at the operation level).

    A more recent descendant of that paper is Bosilca, Delmas, Dongarra, and Langou's "Algorithm-based fault tolerance applied to high performance computing".

    0 讨论(0)
  • 2020-12-02 03:46

    Since you specifically ask for software solutions, and you are using C++, why not use operator overloading to make your own, safe datatypes? For example:

    Instead of using uint32_t (and double, int64_t etc), make your own SAFE_uint32_t which contains a multiple (minimum of 3) of uint32_t. Overload all of the operations you want (* + - / << >> = == != etc) to perform, and make the overloaded operations perform independently on each internal value, ie don't do it once and copy the result. Both before and after, check that all of the internal values match. If values don't match, you can update the wrong one to the value with the most common one. If there is no most-common value, you can safely notify that there is an error.

    This way it doesn't matter if corruption occurs in the ALU, registers, RAM, or on a bus, you will still have multiple attempts and a very good chance of catching errors. Note however though that this only works for the variables you can replace - your stack pointer for example will still be susceptible.

    A side story: I ran into a similar issue, also on an old ARM chip. It turned out to be a toolchain which used an old version of GCC that, together with the specific chip we used, triggered a bug in certain edge cases that would (sometimes) corrupt values being passed into functions. Make sure your device doesn't have any problems before blaming it on radio-activity, and yes, sometimes it is a compiler bug =)

    0 讨论(0)
  • 2020-12-02 03:46

    One point no-one seems to have mentioned. You say you're developing in GCC and cross-compiling onto ARM. How do you know that you don't have code which makes assumptions about free RAM, integer size, pointer size, how long it takes to do a certain operation, how long the system will run for continuously, or various stuff like that? This is a very common problem.

    The answer is usually automated unit testing. Write test harnesses which exercise the code on the development system, then run the same test harnesses on the target system. Look for differences!

    Also check for errata on your embedded device. You may find there's something about "don't do this because it'll crash, so enable that compiler option and the compiler will work around it".

    In short, your most likely source of crashes is bugs in your code. Until you've made pretty damn sure this isn't the case, don't worry (yet) about more esoteric failure modes.

    0 讨论(0)
  • 2020-12-02 03:46

    Someone mentioned using slower chips to prevent ions from flipping bits as easily. In a similar fashion perhaps use a specialized cpu/ram that actually uses multiple bits to store a single bit. Thus providing a hardware fault tolerance because it would be very unlikely that all of the bits would get flipped. So 1 = 1111 but would need to get hit 4 times to actually flipped. (4 might be a bad number since if 2 bits get flipped its already ambiguous). So if you go with 8, you get 8 times less ram and some fraction slower access time but a much more reliable data representation. You could probably do this both on the software level with a specialized compiler(allocate x amount more space for everything) or language implementation (write wrappers for data structures that allocate things this way). Or specialized hardware that has the same logical structure but does this in the firmware.

    0 讨论(0)
  • 2020-12-02 03:49

    Firstly, design your application around failure. Ensure that as part of normal flow operation, it expects to reset (depending on your application and the type of failure either soft or hard). This is hard to get perfect: critical operations that require some degree of transactionality may need to be checked and tweaked at an assembly level so that an interruption at a key point cannot result in inconsistent external commands. Fail fast as soon as any unrecoverable memory corruption or control flow deviation is detected. Log failures if possible.

    Secondly, where possible, correct corruption and continue. This means checksumming and fixing constant tables (and program code if you can) often; perhaps before each major operation or on a timed interrupt, and storing variables in structures that autocorrect (again before each major op or on a timed interrupt take a majority vote from 3 and correct if is a single deviation). Log corrections if possible.

    Thirdly, test failure. Set up a repeatable test environment that flips bits in memory psuedo-randomly. This will allow you to replicate corruption situations and help design your application around them.

    0 讨论(0)
  • 2020-12-02 03:50

    Given supercat's comments, the tendencies of modern compilers, and other things, I'd be tempted to go back to the ancient days and write the whole code in assembly and static memory allocations everywhere. For this kind of utter reliability I think assembly no longer incurs a large percentage difference of the cost.

    0 讨论(0)
提交回复
热议问题