i would to know how to write a profiler? What books and / or articles recommended? Can anyone help me please?
Someone has already done something like this?
As another answer, I just looked at LukeStackwalker on sourceforge. It is a nice, small, example of a stack-sampler, and a nice place to start if you want to write a profiler.
Here, in my opinion, is what it does right:
Sigh ... so near yet so far. Here, IMO, is what it (and other stack samplers like xPerf) should do:
It should retain the raw stack samples. As it is, it summarizes at the function level as it samples. This loses the key line-number information locating the problematic call sites.
It need not take so many samples, if storage to hold them is an issue. Since typical performance problems cost from 10% to 90%, 20-40 samples will show them quite reliably. Hundreds of samples give more measurement precision, but they do not increase the probability of locating the problems.
The UI should summarize in terms of statements, not functions. This is easy to do if the raw samples are kept. The key measure to attach to a statement is the fraction of samples containing it. For example:
5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)
This says that line 326 in MyFile.cpp showed up on 5 out of 20 samples, in the process of calling strlen
. This is very significant, because you can instantly see the problem, and you know how much speedup you can expect from fixing it. If you replace strlen(s)
by s[i]
, it will no longer be spending time in that call, so these samples will not occur, and the speedup will be approximately 1/(1-5/20) = 20/(20-5) = 4/3 = 33% speedup. (Thanks to David Thornley for this sample code.)
The UI should have a "butterfly" view showing statements. (If it shows functions too, that's OK, but the statements are what really matter.) For example:
3/20 MyFile.cpp:502 MyFunction(myArgs)
2/20 HisFile.cpp:113 MyFunction(hisArgs)
5/20 MyFile.cpp:326 for (i = 0; i < strlen(s); ++i)
5/20 strlen.asm:23 ... some assembly code ...
In this example, the line containing the for
statement is the "focus of attention". It occurred on 5 samples. The two lines above it say that on 3 of those samples, it was called from MyFile.cpp:502
, and on 2 of those samples, it was called from HisFile.cpp:113
. The line below it says that on all 5 of those samples, it was in strlen
(no surprise there). In general, the focus line will have a tree of "parents" and a tree of "children". If for some reason, the focus line is not something you can fix, you can go up or down. The goal is to find lines that you can fix that are on as many samples as possible.
IMPORTANT: Profiling should not be looked at as something you do once. For example, in the sample above, we got a 4/3 speedup by fixing one line of code. When the process is repeated, other problematic lines of code should show up at 4/3 the frequency they did before, and thus be easier to find. I never hear of people talking about iterating the profiling process, but it is crucial to getting overall large compounded speedups.
P.S. If a statement occurs more than once in a single sample, that means there is recursion taking place. It is not a problem. It still only counts as one sample containing the statement. It is still the case that the cost of the statement is approximated by the fraction of samples containing it.
Encouraging lot, aren't we :)
Profilers aren't too hard if you're just trying to get a reasonable idea of where the program's spending most of its time. If you're bothered about high accuracy and minimum disruption, things get difficult.
So if yoyu just want the answers a profiler would give you, go for one someone else has written. If you're looking for the intellectual challenge, why not have a go at writing one?
I've written a couple, for run time environments that the years have rendered irrelevant.
There are two approaches
adding something to each function or other significant point that logs the time and where it is.
having a timer going off regularly and taking a peek where the program currently is.
The JVMPI version seems to be the first kind - the link provided by uzhin shows that it can report on quite a number of things (see section 1.3). What gets executed changes to do this, so the profiling can affect the performance (and if you're profiling what was otherwise a very lightweight but often called function, it can mislead).
If you can get a timer/interrupt telling you where the program counter was at the time of the interrupt, you can use the symbol table/debugging information to work out which function it was in at the time. This provides less information but can be less disruptive. A bit more information can be obtained from walking the call stack to identify callers etc. I've no idea if these is even possible in Java...
Paul.
I would look at those open-source projects first:
Then I would look at JVMTI (not JVMPI)
I wrote one once, mainly as an attempt to make "deep sampling" more user-friendly. When you do the method manually, it is explained here. It is based on sampling, but rather than take a large number of small samples, you take a small number of large samples.
It can tell you, for example, that instruction I
(usually a function call) is costing you some percent X of total execution time, more or less, since it appears on the stack on X% of samples.
Think about it, because this is a key point. The call stack exists as long as the program is running. If a particular call instruction I
is on the stack X% of the time, then if that instruction could disappear, that X% of time would disappear. This does not depend on how many times I
is executed, or how long the function call takes. So timers and counters are missing the point. And in a sense all instructions are call instructions, even if they only call microcode.
The sampler is based on the premise that it is better to know the address of instruction I
with precision (because that is what you are looking for) than to know the number X% with precision. If you know that you could save roughly 30% of time by recoding something, do you really care that you might be off by 5%? You're still going to want to fix it. The amount of time it actually saves won't be made any less or greater by your knowing X precisely.
So it is possible to drive samples off of a timer, but frankly I found it just as useful to trigger an interrupt by the user pressing both shift keys as the same time. Since 20 samples is generally plenty, and this way you can be sure to take samples at a relevant time (i.e. not while waiting for user input) it was quite adequate. Another way would be to only do the timer-driven samples while the user holds down both shift keys (or something like that).
It did not concern me that the taking of samples might slow down the program, because the goal was not to measure speed, but to locate the most costly instructions. After fixing something, the overall speedup is easy to measure.
The main thing that the profiler provided was a UI so you could examine the results painlessly. What comes out of the sampling phase is a collection of call stack samples, where each sample is a list of addresses of instructions, where every instruction but the last is a call instruction. The UI was mainly what is called a "butterfly view". It has a current "focus", which is a particular instruction. To the left is displayed the call instructions immediately above that instruction, as culled from the stack samples. If the focus instruction is a call instruction, then the instructions below it appear to the right, as culled from the samples. On the focus instruction is displayed a percent, which is the percent of stacks containing that instruction. Similarly for each instruction on the left or right, the percent is broken down by the frequency of each such instruction. Of course, the instruction was represented by file, line number, and the name of the function it was in. The user could easily explore the data by clicking any of the instructions to make it the new focus.
A variation on this UI treated the butterfly as bipartite, consisting of alternating layers of function call instructions and the functions containing them. That can give a little more clarity of time spent in each function.
Maybe it's not obvious, so it's worth mentioning some properties of this technique.
Recursion is not an issue, because if an instruction appears more than once on any given stack sample, that still counts as only one sample containing it. It still remains true that the estimated time that would be saved by its removal is the percent of stacks it is on.
Notice this is not the same as a call tree. It gives you the cost of an instruction no matter how many different branches of a call tree it is in.
Performance of the UI is not an issue, because the number of samples need not be very large. If a particular instruction I is the focus, it is quite simple to find how may samples contain it, and for each adjacent instruction, how many of the samples containing I also contain the adjacent instruction next to it.
As mentioned before, speed of sampling is not an issue, because we're not measuring performance, we're diagnosing. The sampling does not bias the results, because the sampling does not affect what the overall program does. An algorithm that takes N instructions to complete still takes N instructions even if it is halted any number of times.
I'm often asked how to sample a program that completes in milliseconds. The simple answer is wrap it in an outer loop to make it take long enough to sample. You can find out what takes X% of time, remove it, get the X% speedup, and then remove the outer loop.
This little profiler, that I called YAPA (yet another performance analyzer) was DOS-based and made a nice little demo, but when I had serious work to do, I would fall back on the manual method. The main reason for this is that the call stack alone is often not enough state information to tell you why a particular cycle is being spent. You may also need to know other state information so you have a more complete idea of what the program was doing at that time. Since I found the manual method pretty satisfactory, I shelved the tool.
A point that's often missed when talking about profiling is that you can do it repeatedly to find multiple problems. For example, suppose instruction I1
is on the stack 5% of the time, and I2
is on the stack 50% of the time. Twenty samples will easily find I2
, but maybe not I1
. So you fix I2
. Then you do it all again, but now I1
takes 10% of the time, so 20 samples will probably see it. This magnification effect allows repeated applications of profiling to achieve large compounded speedup factors.
JVMPI spec: http://java.sun.com/j2se/1.5.0/docs/guide/jvmpi/jvmpi.html
I salute your courage and bravery
EDIT: And as noted by user Boune, JVMTI: http://java.sun.com/developer/technicalArticles/Programming/jvmti/