Compiling an application for use in highly radioactive environments

后端 未结 23 1615
名媛妹妹
名媛妹妹 2020-12-02 03:30

We are compiling an embedded C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation. We are using GCC and cross-compiling

相关标签:
23条回答
  • 2020-12-02 03:40

    Here are some thoughts and ideas:

    Use ROM more creatively.

    Store anything you can in ROM. Instead of calculating things, store look-up tables in ROM. (Make sure your compiler is outputting your look-up tables to the read-only section! Print out memory addresses at runtime to check!) Store your interrupt vector table in ROM. Of course, run some tests to see how reliable your ROM is compared to your RAM.

    Use your best RAM for the stack.

    SEUs in the stack are probably the most likely source of crashes, because it is where things like index variables, status variables, return addresses, and pointers of various sorts typically live.

    Implement timer-tick and watchdog timer routines.

    You can run a "sanity check" routine every timer tick, as well as a watchdog routine to handle the system locking up. Your main code could also periodically increment a counter to indicate progress, and the sanity-check routine could ensure this has occurred.

    Implement error-correcting-codes in software.

    You can add redundancy to your data to be able to detect and/or correct errors. This will add processing time, potentially leaving the processor exposed to radiation for a longer time, thus increasing the chance of errors, so you must consider the trade-off.

    Remember the caches.

    Check the sizes of your CPU caches. Data that you have accessed or modified recently will probably be within a cache. I believe you can disable at least some of the caches (at a big performance cost); you should try this to see how susceptible the caches are to SEUs. If the caches are hardier than RAM then you could regularly read and re-write critical data to make sure it stays in cache and bring RAM back into line.

    Use page-fault handlers cleverly.

    If you mark a memory page as not-present, the CPU will issue a page fault when you try to access it. You can create a page-fault handler that does some checking before servicing the read request. (PC operating systems use this to transparently load pages that have been swapped to disk.)

    Use assembly language for critical things (which could be everything).

    With assembly language, you know what is in registers and what is in RAM; you know what special RAM tables the CPU is using, and you can design things in a roundabout way to keep your risk down.

    Use objdump to actually look at the generated assembly language, and work out how much code each of your routines takes up.

    If you are using a big OS like Linux then you are asking for trouble; there is just so much complexity and so many things to go wrong.

    Remember it is a game of probabilities.

    A commenter said

    Every routine you write to catch errors will be subject to failing itself from the same cause.

    While this is true, the chances of errors in the (say) 100 bytes of code and data required for a check routine to function correctly is much smaller than the chance of errors elsewhere. If your ROM is pretty reliable and almost all the code/data is actually in ROM then your odds are even better.

    Use redundant hardware.

    Use 2 or more identical hardware setups with identical code. If the results differ, a reset should be triggered. With 3 or more devices you can use a "voting" system to try to identify which one has been compromised.

    0 讨论(0)
  • 2020-12-02 03:40

    This answer assumes you are concerned with having a system that works correctly, over and above having a system that is minimum cost or fast; most people playing with radioactive things value correctness / safety over speed / cost

    Several people have suggested hardware changes you can make (fine - there's lots of good stuff here in answers already and I don't intend repeating all of it), and others have suggested redundancy (great in principle), but I don't think anyone has suggested how that redundancy might work in practice. How do you fail over? How do you know when something has 'gone wrong'? Many technologies work on the basis everything will work, and failure is thus a tricky thing to deal with. However, some distributed computing technologies designed for scale expect failure (after all with enough scale, failure of one node of many is inevitable with any MTBF for a single node); you can harness this for your environment.

    Here are some ideas:

    • Ensure that your entire hardware is replicated n times (where n is greater than 2, and preferably odd), and that each hardware element can communicate with each other hardware element. Ethernet is one obvious way to do that, but there are many other far simpler routes that would give better protection (e.g. CAN). Minimise common components (even power supplies). This may mean sampling ADC inputs in multiple places for instance.

    • Ensure your application state is in a single place, e.g. in a finite state machine. This can be entirely RAM based, though does not preclude stable storage. It will thus be stored in several place.

    • Adopt a quorum protocol for changes of state. See RAFT for example. As you are working in C++, there are well known libraries for this. Changes to the FSM would only get made when a majority of nodes agree. Use a known good library for the protocol stack and the quorum protocol rather than rolling one yourself, or all your good work on redundancy will be wasted when the quorum protocol hangs up.

    • Ensure you checksum (e.g. CRC/SHA) your FSM, and store the CRC/SHA in the FSM itself (as well as transmitting in the message, and checksumming the messages themselves). Get the nodes to check their FSM regularly against these checksum, checksum incoming messages, and check their checksum matches the checksum of the quorum.

    • Build as many other internal checks into your system as possible, making nodes that detect their own failure reboot (this is better than carrying on half working provided you have enough nodes). Attempt to let them cleanly remove themselves from the quorum during rebooting in case they don't come up again. On reboot have them checksum the software image (and anything else they load) and do a full RAM test before reintroducing themselves to the quorum.

    • Use hardware to support you, but do so carefully. You can get ECC RAM, for instance, and regularly read/write through it to correct ECC errors (and panic if the error is uncorrectable). However (from memory) static RAM is far more tolerant of ionizing radiation than DRAM is in the first place, so it may be better to use static DRAM instead. See the first point under 'things I would not do' as well.

    Let's say you have an 1% chance of failure of any given node within one day, and let's pretend you can make failures entirely independent. With 5 nodes, you'll need three to fail within one day, which is a .00001% chance. With more, well, you get the idea.

    Things I would not do:

    • Underestimate the value of not having the problem to start off with. Unless weight is a concern, a large block of metal around your device is going to be a far cheaper and more reliable solution than a team of programmers can come up with. Ditto optical coupling of inputs of EMI is an issue, etc. Whatever, attempt when sourcing your components to source those rated best against ionizing radiation.

    • Roll your own algorithms. People have done this stuff before. Use their work. Fault tolerance and distributed algorithms are hard. Use other people's work where possible.

    • Use complicated compiler settings in the naive hope you detect more failures. If you are lucky, you may detect more failures. More likely, you will use a code-path within the compiler which has been less tested, particularly if you rolled it yourself.

    • Use techniques which are untested in your environment. Most people writing high availability software have to simulate failure modes to check their HA works correctly, and miss many failure modes as a result. You are in the 'fortunate' position of having frequent failures on demand. So test each technique, and ensure its application actual improves MTBF by an amount that exceeds the complexity to introduce it (with complexity comes bugs). Especially apply this to my advice re quorum algorithms etc.

    0 讨论(0)
  • 2020-12-02 03:41

    Use a cyclic scheduler. This gives you the ability to add regular maintenance times to check the correctness of critical data. The problem most often encountered is corruption of the stack. If your software is cyclical you can reinitialise the stack between cycles. Do not reuse the stacks for interrupt calls, setup a separate stack of each important interrupt call.

    Similar to the Watchdog concept is deadline timers. Start a hardware timer before calling a function. If the function does not return before the deadline timer interrupts then reload the stack and try again. If it still fails after 3/5 tries you need reload from ROM.

    Split your software into parts and isolate these parts to use separate memory areas and execution times (Especially in a control environment). Example: signal acquisition, prepossessing data, main algorithm and result implementation/transmission. This means a failure in one part will not cause failures through the rest of the program. So while we are repairing the signal acquisition the rest of tasks continues on stale data.

    Everything needs CRCs. If you execute out of RAM even your .text needs a CRC. Check the CRCs regularly if you using a cyclical scheduler. Some compilers (not GCC) can generate CRCs for each section and some processors have dedicated hardware to do CRC calculations, but I guess that would fall out side of the scope of your question. Checking CRCs also prompts the ECC controller on the memory to repair single bit errors before it becomes a problem.

    0 讨论(0)
  • 2020-12-02 03:42

    Working for about 4-5 years with software/firmware development and environment testing of miniaturized satellites*, I would like to share my experience here.

    *(miniaturized satellites are a lot more prone to single event upsets than bigger satellites due to its relatively small, limited sizes for its electronic components)

    To be very concise and direct: there is no mechanism to recover from detectable, erroneous situation by the software/firmware itself without, at least, one copy of minimum working version of the software/firmware somewhere for recovery purpose - and with the hardware supporting the recovery (functional).

    Now, this situation is normally handled both in the hardware and software level. Here, as you request, I will share what we can do in the software level.

    1. ...recovery purpose.... Provide ability to update/recompile/reflash your software/firmware in real environment. This is an almost must-have feature for any software/firmware in highly ionized environment. Without this, you could have redundant software/hardware as many as you want but at one point, they are all going to blow up. So, prepare this feature!

    2. ...minimum working version... Have responsive, multiple copies, minimum version of the software/firmware in your code. This is like Safe mode in Windows. Instead of having only one, fully functional version of your software, have multiple copies of the minimum version of your software/firmware. The minimum copy will usually having much less size than the full copy and almost always have only the following two or three features:

      1. capable of listening to command from external system,
      2. capable of updating the current software/firmware,
      3. capable of monitoring the basic operation's housekeeping data.
    3. ...copy... somewhere... Have redundant software/firmware somewhere.

      1. You could, with or without redundant hardware, try to have redundant software/firmware in your ARM uC. This is normally done by having two or more identical software/firmware in separate addresses which sending heartbeat to each other - but only one will be active at a time. If one or more software/firmware is known to be unresponsive, switch to the other software/firmware. The benefit of using this approach is we can have functional replacement immediately after an error occurs - without any contact with whatever external system/party who is responsible to detect and to repair the error (in satellite case, it is usually the Mission Control Centre (MCC)).

        Strictly speaking, without redundant hardware, the disadvantage of doing this is you actually cannot eliminate all single point of failures. At the very least, you will still have one single point of failure, which is the switch itself (or often the beginning of the code). Nevertheless, for a device limited by size in a highly ionized environment (such as pico/femto satellites), the reduction of the single point of failures to one point without additional hardware will still be worth considering. Somemore, the piece of code for the switching would certainly be much less than the code for the whole program - significantly reducing the risk of getting Single Event in it.

      2. But if you are not doing this, you should have at least one copy in your external system which can come in contact with the device and update the software/firmware (in the satellite case, it is again the mission control centre).

      3. You could also have the copy in your permanent memory storage in your device which can be triggered to restore the running system's software/firmware
    4. ...detectable erroneous situation.. The error must be detectable, usually by the hardware error correction/detection circuit or by a small piece of code for error correction/detection. It is best to put such code small, multiple, and independent from the main software/firmware. Its main task is only for checking/correcting. If the hardware circuit/firmware is reliable (such as it is more radiation hardened than the rests - or having multiple circuits/logics), then you might consider making error-correction with it. But if it is not, it is better to make it as error-detection. The correction can be by external system/device. For the error correction, you could consider making use of a basic error correction algorithm like Hamming/Golay23, because they can be implemented more easily both in the circuit/software. But it ultimately depends on your team's capability. For error detection, normally CRC is used.

    5. ...hardware supporting the recovery Now, comes to the most difficult aspect on this issue. Ultimately, the recovery requires the hardware which is responsible for the recovery to be at least functional. If the hardware is permanently broken (normally happen after its Total ionizing dose reaches certain level), then there is (sadly) no way for the software to help in recovery. Thus, hardware is rightly the utmost importance concern for a device exposed to high radiation level (such as satellite).

    In addition to the suggestion for above anticipating firmware's error due to single event upset, I would also like to suggest you to have:

    1. Error detection and/or error correction algorithm in the inter-subsystem communication protocol. This is another almost must have in order to avoid incomplete/wrong signals received from other system

    2. Filter in your ADC reading. Do not use the ADC reading directly. Filter it by median filter, mean filter, or any other filters - never trust single reading value. Sample more, not less - reasonably.

    0 讨论(0)
  • 2020-12-02 03:42

    This is an extremely broad subject. Basically, you can't really recover from memory corruption, but you can at least try to fail promptly. Here are a few techniques you could use:

    • checksum constant data. If you have any configuration data which stays constant for a long time (including hardware registers you have configured), compute its checksum on initialization and verify it periodically. When you see a mismatch, it's time to re-initialize or reset.

    • store variables with redundancy. If you have an important variable x, write its value in x1, x2 and x3 and read it as (x1 == x2) ? x2 : x3.

    • implement program flow monitoring. XOR a global flag with a unique value in important functions/branches called from the main loop. Running the program in a radiation-free environment with near-100% test coverage should give you the list of acceptable values of the flag at the end of the cycle. Reset if you see deviations.

    • monitor the stack pointer. In the beginning of the main loop, compare the stack pointer with its expected value. Reset on deviation.

    0 讨论(0)
  • 2020-12-02 03:43

    Writing code for radioactive environments is not really any different than writing code for any mission-critical application.

    In addition to what has already been mentioned, here are some miscellaneous tips:

    • Use everyday "bread & butter" safety measures that should be present on any semi-professional embedded system: internal watchdog, internal low-voltage detect, internal clock monitor. These things shouldn't even need to be mentioned in the year 2016 and they are standard on pretty much every modern microcontroller.

    • If you have a safety and/or automotive-oriented MCU, it will have certain watchdog features, such as a given time window, inside which you need to refresh the watchdog. This is preferred if you have a mission-critical real-time system.

    • In general, use a MCU suitable for these kind of systems, and not some generic mainstream fluff you received in a packet of corn flakes. Almost every MCU manufacturer nowadays have specialized MCUs designed for safety applications (TI, Freescale, Renesas, ST, Infineon etc etc). These have lots of built-in safety features, including lock-step cores: meaning that there are 2 CPU cores executing the same code, and they must agree with each other.

    • IMPORTANT: You must ensure the integrity of internal MCU registers. All control & status registers of hardware peripherals that are writeable may be located in RAM memory, and are therefore vulnerable.

      To protect yourself against register corruptions, preferably pick a microcontroller with built-in "write-once" features of registers. In addition, you need to store default values of all hardware registers in NVM and copy-down those values to your registers at regular intervals. You can ensure the integrity of important variables in the same manner.

      Note: always use defensive programming. Meaning that you have to setup all registers in the MCU and not just the ones used by the application. You don't want some random hardware peripheral to suddenly wake up.

    • There are all kinds of methods to check for errors in RAM or NVM: checksums, "walking patterns", software ECC etc etc. The best solution nowadays is to not use any of these, but to use a MCU with built-in ECC and similar checks. Because doing this in software is complex, and the error check in itself could therefore introduce bugs and unexpected problems.

    • Use redundancy. You could store both volatile and non-volatile memory in two identical "mirror" segments, that must always be equivalent. Each segment could have a CRC checksum attached.

    • Avoid using external memories outside the MCU.

    • Implement a default interrupt service routine / default exception handler for all possible interrupts/exceptions. Even the ones you are not using. The default routine should do nothing except shutting off its own interrupt source.

    • Understand and embrace the concept of defensive programming. This means that your program needs to handle all possible cases, even those that cannot occur in theory. Examples.

      High quality mission-critical firmware detects as many errors as possible, and then handles or ignores them in a safe manner.

    • Never write programs that rely on poorly-specified behavior. It is likely that such behavior might change drastically with unexpected hardware changes caused by radiation or EMI. The best way to ensure that your program is free from such crap is to use a coding standard like MISRA, together with a static analyser tool. This will also help with defensive programming and with weeding out bugs (why would you not want to detect bugs in any kind of application?).

    • IMPORTANT: Don't implement any reliance of the default values of static storage duration variables. That is, don't trust the default contents of the .data or .bss. There could be any amount of time between the point of initialization to the point where the variable is actually used, there could have been plenty of time for the RAM to get corrupted. Instead, write the program so that all such variables are set from NVM in run-time, just before the time when such a variable is used for the first time.

      In practice this means that if a variable is declared at file scope or as static, you should never use = to initialize it (or you could, but it is pointless, because you cannot rely on the value anyhow). Always set it in run-time, just before use. If it is possible to repeatedly update such variables from NVM, then do so.

      Similarly in C++, don't rely on constructors for static storage duration variables. Have the constructor(s) call a public "set-up" routine, which you can also call later on in run-time, straight from the caller application.

      If possible, remove the "copy-down" start-up code that initializes .data and .bss (and calls C++ constructors) entirely, so that you get linker errors if you write code relying on such. Many compilers have the option to skip this, usually called "minimal/fast start-up" or similar.

      This means that any external libraries have to be checked so that they don't contain any such reliance.

    • Implement and define a safe state for the program, to where you will revert in case of critical errors.

    • Implementing an error report/error log system is always helpful.

    0 讨论(0)
提交回复
热议问题