问题
I want to profile my c++ program with linux perf. For this I used the three following commands and I do not understand why I get three completely different reports.
perf record --call-graph dwarf ./myProg
perf report
perf record --call-graph fp ./myProg
perf report
perf record --call-graph lbr ./myProg
perf report
Also I do not understand why the main
function is not the highest function in the list.
The logic of my program is the following, the main
function calls the getPogDocumentFromFile
function which calls fromPoxml
which calls toPred
which calls applySubst
which calls subst
. Moreover toPred
, applySubst
and subst
are recursive functions. And I expect them to be the bottleneck.
Some more comments: my program runs about 25 minutes, it is highly recursive and allocates a lot (~17Go) of memory. Also I compile with -fno-omit-frame-pointer
and use a recent intel CPU.
Any Idea?
EDIT:
Thinking again about my question, I realize that I do not understand the meaning of the Children column.
So far I assumed that the Self column was the percentage of samples with the function we are looking at at the top of the call stack and the Children column was the percentage of samples with the function anywhere in the call stack. Obviously this is not the case, otherwise the main function would have its children column not far from 100%. Maybe the callstack is truncated? Or am I completely misunderstanding how profilers work?
回答1:
Man page of pref report documents the call chains display with children accumulation:
--children Accumulate callchain of children to parent entry so that then can show up in the output. The output will have a new "Children" column and will be sorted on the data. It requires callchains are recorded. See the ‘overhead calculation’ section for more details. Enabled by default, disable with --no-children.
I can recommend you to try non-default mode with --no-children
option of perf report
(or perf top -g --no-children -p $PID_OF_PROGRAM
)
So in default mode when there is some callchain data in perf.data file, perf report will calculate "self" and "self+children" overhead and sort on accumulated data. It means that if some function f1()
has 10% of "self" samples and calls some leaf function f2()
with 20% of "self" samples, then f1()
self+children will be 30%. Accumulated data is for all stacks where current function was mentioned: for the work done in it itself, and work in all direct and indirect children (descendants).
You can specify some of call stack sampling method in --call-graph
option (dwarf / lbr / fp), and they may have some limitations. Sometimes methods (especially fp) may fail to extract parts of call stack. -fno-omit-frame-pointer
option may help, but when it is used in your executable but not in some library with callback, then call stack will be extracted partially. Some very long call chains may be not extracted too by some methods. Or perf report
may fail to handle some cases.
To check for truncated call chain samples, use perf script|less
somewhere in the middle. In this mode it does print every recorded sample with all detected function names, check for samples not ending with main
and __libc_start_main
- they are truncated.
otherwise the main function would have its children column not far from 100%
Yes, for single threaded program and correctly recorded and processed call stacks, main
should have something like 99% in "Children" column. For multithreaded programs second and other threads will have another root node like start_thread.
来源:https://stackoverflow.com/questions/59307540/profiling-my-program-with-linux-perf-and-different-call-graph-modes-gives-differ