I need to profile a real time C++ app on Windows. Most of the available profilers are either terribly expensive, total overkill, or both. I don\'t need any .NET stuff. Since
Performance Validator (from Software Verification, the company I work for) seems to match what you are looking for:
I occasionally use an application called Very Sleepy: http://www.codersnotes.com/sleepy
It's a simple, unassuming tool, and I don't know how well it suits your needs. It's done well enough for me, as a fairly straightforward sampling profiler. I am authoring a .NET profiler called SlimTune that will gain native support eventually -- but it's not in there now, and it could be some months before it's available.
When I have to profile realtime code, I think the only solution is something hand-rolled. You don't want too much coverage or you end up slowing the code down, but with a small data set, you need to be very focused, essentially picking each point by hand.
So I wrote a header file several years ago that defines some macros and a mechanism for capturing data, either as function timings or as a timeline (at time T in function X). The code uses QueryPerformanceCounter for the timings and writes the data into named shared memory via CreateFileMapping so that I can look at the timing data from another process live.
It takes a recompile to change what timing information I want to capture, but the code is so inexpensive that It has virtually no effect on the code.
All of the code is in the header file, (with macro guards so the code only gets included once). so the header file itself is my 'profiler'. I change some tables in the header, then and markup the target code, recompile and start profiling.
Since it is a real time app, I need the profiler to be as fast as possible.
I don't know what you mean by real-time (hard, semi-hard, soft).
I once had to improve the performance of a fax server. The fax protocol is such that if either end delays too long (some tens or hundreds of milliseconds, depending) then the fax session is disconnected. I was therefore unable to use any commercial profiler that was available to me, because they slowed the execution of the server too much: and so instead I added various log messages (with time stamps) to instrument the code and thus find the bottle-necks.
We do a fair amount of profiling, and have used Shark (OSX only), vTune, Glowcode and the old favourite of counters/clocks.
Of those Shark is by far and away the best (and free!), to the extent that I try to keep code portable to OSX so I can use it to profile. Unfortunately, it doesn't meet your requirements.
vTune was wholly unimpressive, it was too complicated to get a decent profile out of without being an expert in what all the profiling options, the front end GUI frequently crashed or just plain broke and its sampler doesn't sample the call stack making it almost useless for actually seeing how bottlenecks in your program are arising. It was also expensive (although we ended up buying a licence). In its favour it is cross platform, and you can get a 30-day trial to see if you like it.
Glowcode was decent, IIRC windows only and also offers a free trial. It's been a while since we used it but it might not be a bad place to start.
We mostly use clocks for our embedded code, which runs single process with little or no system overhead - meaning we can count exactly the number of clock cycles operations take. Personally I wouldn't recommend "rolling your own" profiling code (except at an extremely coarse scale) for two reasons:
I've used AMD CodeAnalyst to great effect, but naturally it has to run on an AMD processor. It's more a case of "tells you more than you want to know" if you dig deep enough. http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx