There's more than one way to do it.
Don't forget the no-profiler method.
Most profilers assume you need 1) high statistical precision of timing (lots of samples), and 2) low precision of problem identification (functions & call-graphs).
Those priorities can be reversed. I.e. the problem can be located to the precise machine address, while cost precision is a function of the number of samples.
Most real problems cost at least 10%, where high precision is not essential.
Example: If something is making your program take 2 times as long as it should, that means there is some code in it that costs 50%. If you take 10 samples of the call stack while it is being slow, the precise line(s) of code will be present on roughly 5 of them. The larger the program is, the more likely the problem is a function call somewhere mid-stack.
It's counter-intuiitive, I know.
NOTE: xPerf is nearly there, but not quite (as far as I can tell). It takes samples of the call stack and saves them - that's good. Here's what I think it needs:
It should only take samples when you want them. As it is, you have to filter out the irrelevant ones.
In the stack view it should show specific lines or addresses at which calls take place, not just whole functions. (Maybe it can do this, I couldn't tell from the blog.)
If you click to get the butterfly view, centered on a single call instruction, or leaf instruction, it should show you not the CPU fraction, but the fraction of stack samples containing that instruction. That would be a direct measure of the cost of that instruction, as a fraction of time. (Maybe it can do this, I couldn't tell.)
So, for example, even if an instruction were a call to file-open or something else that idles the thread, it still costs wall clock time, and you need to know that.
NOTE: I just looked over Luke Stackwalker, and the same remarks apply. I think it is on the right track but needs UI work.
ADDED: Having looked over LukeStackwalker more carefully, I'm afraid it falls victim to the assumption that measuring functions is more important than locating statements. So on each sample of the call stack, it updates the function-level timing info, but all it does with the line-number info is keep track of min and max line numbers in each function, which, the more samples it takes, the farther apart those get. So it basically throws away the most important information - the line number information. The reason that is important is that if you decide to optimize a function, you need to know which lines in it need work, and those lines were on the stack samples (before they were discarded).
One might object that if the line number information were retained it would run out of storage quickly. Two answers. 1) There are only so many lines that show up on the samples, and they show up repeatedly. 2) Not so many samples are needed - the assumption that high statistical precision of measurement is necessary has always been assumed, but never justified.
I suspect other stack samplers, like xPerf, have similar issues.