I am working with a multithreaded embedded application. Each thread is allocated stack sizes based on its functionality. Recently we found that one of the thread corrupted the s
ARM9 has JTAG/ETM debugging support on-die; you should be able to set up a data access watchpoint covering e.g. 64 bytes near the top of your stacks, which would then trigger a data abort, which you could catch in your program or externally.
(The hardware I work with only supports 2 read/write watchpoints, not sure if that's a limitation of the on-chip stuff or the surrounding third-party debug kit.)
This document, which is an extremely low-level description of how to interface with the JTAG functionality, suggests you read your processor's Technical Reference Manual -- and I can vouch that there's a decent amount of higher-level info in chapter 9 ("Debug Support") for the ARM946E-S r1p1 TRM.
Before you dig into understanding all this stuff (unless you're just doing it for fun/education), double-check that the hardware and software you're using won't already manage breakpoints/watchpoints for you. The concept of "watchpoint" was a bit hard to find in the debugging software we use -- it was a tab labelled "Hardware" in the add breakpoint dialog.
Another alternative: your compiler may support a command-line option to add function calls at the entry and exit points of functions (some sort of "void enterFunc(const char * callingFunc)" and "void exitFunc(const char * callingFunc)"), for function cost profiling, more accurate stack tracing, or similar. You can then write these functions to check your stack canary value.
(As an aside, in our case we actually ignore the function name that is passed in (I wish I could get the linker to strip these) and just use the processor's link register (LR) value to record where we came from. We use this for getting accurate call traces as well as profiling information; checking the stack canaries at this point would be trivial too!)
The problem is, of course, that calling these functions changes the register and stack profiles for the functions a bit... Not much, in our experiments, but a bit. The performance implications are worse, and wherever there's a performance implication there's the chance of a behavior change in the program, which may mean you e.g. avoid triggering a deep-recursion case that you might have before...
Very late update: these days, if you have a clang+LLVM based pipeline, you may be able to use Address Sanitizer (ASAN) to catch some of these. Be on the lookout for similar features in your compiler! (It's worth knowing about UBSAN and the other sanitizers too.)