Which is the most reliable profiling tool gprof or kcachegrind?

前端 未结 2 842
悲&欢浪女
悲&欢浪女 2021-02-10 12:25

Profiling some C++ number crunching code with both gprof and kcachegrind gives similar results for the functions that contribute most to the execution

相关标签:
2条回答
  • 2021-02-10 12:56

    gprof's timing data is statistical (read about it in details of profiling docs).

    On the other hand, KCacheGrind uses valgrind which actually interprets all the code.

    So KCacheGrind can be "more accurate" (at the expense of more overhead) if the CPU modeled by valgrind is close to your real CPU.

    Which one to choose also depends on what type of overhead you can handle. In my experience, gprof adds less runtime overhead (execution time that is), but it is more intrusive (i.e. -pg adds code to each and every one of your functions). So depending on the situation, on or the other is more appropriate.

    For "better" gprof data, run your code longer (and on as wide a range of test data you can). The more you have, the better the measurements will be statistically.

    0 讨论(0)
  • 2021-02-10 13:13

    gprof is actually quite primitive. Here's what it does. 1) It samples the program counter at a constant rate and records how many samples land in each function (exclusive time). 2) It counts how many times any function A calls any function B. From that it can find out how many times each function was called in total, and what it's average exclusive time was. To get average inclusive time of each function it propagates exclusive time upward in the call graph.

    If you're expecting this to have some kind of accuracy, you should be aware of some issues. First, it only counts CPU-time-in-process, meaning it is blind to I/O or other system calls. Second, recursion confuses it. Third, the premise that functions always adhere to an average run time, no matter when they are called or who calls them, is very suspect. Forth, the notion that functions (and their call graph) are what you need to know about, rather than lines of code, is simply a popular assumption, nothing more. Fifth, the notion that accuracy of measurement is even relevant to finding "bottlenecks" is also just a popular assumption, nothing more.

    Callgrind can work at the level of lines - that's good. Unfortunately it shares the other problems.

    If your goal is to find "bottlenecks" (as opposed to getting general measurements), you should take a look at wall-clock time stack samplers that report percent-by-line, such as Zoom. The reason is simple but possibly unfamiliar.

    Suppose you have a program with a bunch of functions calling each other that takes a total of 10 seconds. Also, there is a sampler that samples, not just the program counter, but the entire call stack, and it does it all the time at a constant rate, like 100 times per second. (Ignore other processes for now.)

    So at the end you have 1000 samples of the call stack. Pick any line of code L that appears on more than one of them. Suppose you could somehow optimize that line, by avoiding it, removing it, or passing it off to a really really fast processor.

    What would happen to those samples?

    Since that line of code L now takes (essentially) no time at all, no sample can hit it, so those samples would just disappear, reducing the total number of samples, and therefore the total time! In fact the overall time would be reduced by the fraction of time L had been on the stack, which is roughly the fraction of samples that contained it.

    I don't want to get too statistical, but many people think you need a lot of samples, because they think accuracy of measurement is important. It isn't, if the reason you're doing this is to find out what to fix to get speedup. The emphasis is on finding what to fix, not on measuring it. Line L is on the stack some fraction F of the time, right? So each sample has a probability F of hitting it, right? Just like flipping a coin. There is a theory of this, called the Rule of Succession. It says that (under simplifying but general assumptions), if you flip a coin N times, and see "heads" S times, you can estimate the fairness of the coin F as (on average) (S+1)/(N+2). So, if you take as few as three samples, and see L on two of them, do you know what F is? Of course not. But you do know on average it is (2+1)/(3+2) or 60%. So that's how much time you could save (on average) by "optimizing away" line L. And, of course, the stack samples showed you exactly where line L (the "bottleneck"**) is. Did it really matter that you didn't measure it to two or three decimal places?

    BTW, it is immune to all the other problems mentioned above.

    **I keep putting quotes around "bottleneck" because what makes most software slow has nothing in common with the neck of a bottle. A better metaphor is a "drain" - something that just needlessly wastes time.

    0 讨论(0)
提交回复
热议问题