I am trying to improve the performance of the threaded application with real-time deadlines. It is running on Windows Mobile and written in C / C++. I have a suspicion that
Context Switch is very expensive. Not because of the CPU operation itself, but because of cache invalidation. If you have an intensive task running, it will fill the CPU cache, both for instructions and data, also the memory prefetch, TLB and RAM will optimize the work toward some areas of ram.
When you change context all these cache mechanisms are reset and the new thread start from "blank" state.
The accepted answer is wrong unless your thread are just incrementing a counter. Of course there is no cache flush involved in this case. There is no point in benchmarking context switching without filling cache like real applications.
Context Switch is expensive, as a rule of thumb it costs 30µs of CPU overhead http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
While you said you don't want to write a test application, I did this for a previous test on an ARM9 Linux platform to find out what the overhead is. It was just two threads that would boost::thread::yield() (or, you know) and increment some variable, and after a minute or so (without other running processes, at least none that do something), the app printed how many context switches it could do per second. Of course this is not really exact, but the point is that both threads yielded the CPU to each other, and it was so fast that it just didn't make sense any more to think about the overhead. So, simply go ahead and just write a simple test instead of thinking too much about a problem that may be non-existent.
Other than that, you might try like 1800 suggested with performance counters.
Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with intensive switching at times, and never ran into performance issues. We also tried to implement the core threading thing without threads at all, and saw no performance improvement (the GUI just responded much slower, but everything else was the same). Maybe you can try the same, by either reducing the number of context switches or by removing threads completely (just for testing).