I am working with a multithreaded embedded application. Each thread is allocated stack sizes based on its functionality. Recently we found that one of the thread corrupted the s
I have done exactly as you have suggested on dsPIC using CMX-Tiny+, however in the stack check I also maintain a 'hide-tide mark' for each stack. Rather than checking the value at the top of the stack, I iterate from the top to find the first non-signature value, and if this is higher than previously, I store it in a static variable. This is done in a lowest priority task so that it is performed whenever nothing else is scheduled (essentially replacing the idle-loop; in your RTOS you may be able to hook the idle loop and do it there). This means that it is typically checked more often than your 10ms periodic check; in that time the whole scheduler could be screwed.
My methodology is then to oversize the stacks, exercise the code, then check the high-tide marks to determine the margin for each task (and the ISR stack - don't forget that!), and adjust the stacks accordingly if I need to recover the 'wasted' space from the oversize stacks (I don't bother if the space is otherwise not needed).
The advantage of this approach is you don't wait until the stack is broken to detect a potential problem; you monitor it as you develop and as changes are checked in. This is useful since if the corruption hits a TCB or return address, your scheduler may be so broken the check never kicks in after an overflow.
Some RTOSes have this functionality built in (embOS, vxWorks that I know of). OS's that make use of MMU hardware may fare better by placing the stack in a protected memory space so an overflow causes a data abort. That is the 'better way' you seek perhaps; ARM9 has an MMU, but OS's that support it well tend to be more expensive. QNX Neutrino perhaps?
If you don't want to do the high-tide checking manually, simply oversize the stacks by say 1K, and then in the stack-check task trap the condition when the margin drops below 1K. That way you are more likely to trap the error condition while the scheduler is still viable. Not fool proof, but if you start allocating objects large enough the blow the stack in one go, alarm bells should ring in your head in any case - its the more common slow stack creep caused by ever deeper function nesting and the like that this will help with.
Clifford.
Checkout these similar questions: handling stack overflows in embedded systems and how can I visualise the memory sram usage of an avr program.
Personally I would use the Memory Management Unit of your Processor it it has one. It can do memory checking for you with minimal software overhead.
Set up a memory area in the MMU that will be used for the stack. It should be bordered by two memory areas where the MMU does not allow access. When your application is running you will receive a exception/interrupt as soon as you overflow the stack.
Because you get a exception at the moment the error occur you know exactly where in your application the stack went bad. You can look at the call stack to see exactly how you got to where you are. This makes it a lot easier to find your problem than trying to figure out what is wrong by detecting your problem long after it happened.
A MMU can also detect zero pointer accesses if you disallow memory access to the bottom part of your ram.
If you have the source of the RTOS you can build MMU protection of the stack and heap into it.