I am working with a multithreaded embedded application. Each thread is allocated stack sizes based on its functionality. Recently we found that one of the thread corrupted the s
I have done exactly as you have suggested on dsPIC using CMX-Tiny+, however in the stack check I also maintain a 'hide-tide mark' for each stack. Rather than checking the value at the top of the stack, I iterate from the top to find the first non-signature value, and if this is higher than previously, I store it in a static variable. This is done in a lowest priority task so that it is performed whenever nothing else is scheduled (essentially replacing the idle-loop; in your RTOS you may be able to hook the idle loop and do it there). This means that it is typically checked more often than your 10ms periodic check; in that time the whole scheduler could be screwed.
My methodology is then to oversize the stacks, exercise the code, then check the high-tide marks to determine the margin for each task (and the ISR stack - don't forget that!), and adjust the stacks accordingly if I need to recover the 'wasted' space from the oversize stacks (I don't bother if the space is otherwise not needed).
The advantage of this approach is you don't wait until the stack is broken to detect a potential problem; you monitor it as you develop and as changes are checked in. This is useful since if the corruption hits a TCB or return address, your scheduler may be so broken the check never kicks in after an overflow.
Some RTOSes have this functionality built in (embOS, vxWorks that I know of). OS's that make use of MMU hardware may fare better by placing the stack in a protected memory space so an overflow causes a data abort. That is the 'better way' you seek perhaps; ARM9 has an MMU, but OS's that support it well tend to be more expensive. QNX Neutrino perhaps?
If you don't want to do the high-tide checking manually, simply oversize the stacks by say 1K, and then in the stack-check task trap the condition when the margin drops below 1K. That way you are more likely to trap the error condition while the scheduler is still viable. Not fool proof, but if you start allocating objects large enough the blow the stack in one go, alarm bells should ring in your head in any case - its the more common slow stack creep caused by ever deeper function nesting and the like that this will help with.
Clifford.