One could use a profiler, but why not just halt the program? [closed]

前端未结

关注

 17  2428

悲&欢浪女

相关标签:

17条回答

渐次进展

2020-11-22 00:52

There's a difference between things that programmers actually do, and things that they recommend others do.

I know of lots of programmers (myself included) that actually use this method. It only really helps to find the most obvious of performance problems, but it's quick and dirty and it works.

But I wouldn't really tell other programmers to do it, because it would take me too long to explain all the caveats. It's far too easy to make an inaccurate conclusion based on this method, and there are many areas where it just doesn't work at all. (for example, that method doesn't reveal any code that is triggered by user input).

So just like using lie detectors in court, or the "goto" statement, we just don't recommend that you do it, even though they all have their uses.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-11-22 00:53

What if the program is in production and being used at the same time by paying clients or colleagues. A profiler allows you to observe without interferring (as much, because of course it will have a little hit too as per the Heisenberg principle).

Profiling can also give you much richer and more detailed accurate reports. This will be quicker in the long run.

0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-11-22 00:56

Callstack sampling is a very useful technique for profiling, especially when looking at a large, complicated codebase that could be spending its time in any number of places. It has the advantage of measuring the CPU's usage by wall-clock time, which is what matters for interactivity, and getting callstacks with each sample lets you see why a function is being called. I use it a lot, but I use automated tools for it, such as Luke Stackwalker and OProfile and various hardware-vendor-supplied things.

The reason I prefer automated tools over manual sampling for the work I do is statistical power. Grabbing ten samples by hand is fine when you've got one function taking up 40% of runtime, because on average you'll get four samples in it, and always at least one. But you need more samples when you have a flat profile, with hundreds of leaf functions, none taking more than 1.5% of the runtime.

Say you have a lake with many different kinds of fish. If 40% of the fish in the lake are salmon (and 60% "everything else"), then you only need to catch ten fish to know there's a lot of salmon in the lake. But if you have hundreds of different species of fish, and each species is individually no more than 1%, you'll need to catch a lot more than ten fish to be able to say "this lake is 0.8% salmon and 0.6% trout."

Similarly in the games I work on, there are several major systems each of which call dozens of functions in hundreds of different entities, and all of this happens 60 times a second. Some of those functions' time funnels into common operations (like malloc), but most of it doesn't, and in any case there's no single leaf that occupies more than 1000 μs per frame.

I can look at the trunk functions and see, "we're spending 10% of our time on collision", but that's not very helpful: I need to know exactly where in collision, so I know which functions to squeeze. Just "do less collision" only gets you so far, especially when it means throwing out features. I'd rather know "we're spending an average 600 μs/frame on cache misses in the narrow phase of the octree because the magic missile moves so fast and touches lots of cells," because then I can track down the exact fix: either a better tree, or slower missiles.

Manual sampling would be fine if there were a big 20% lump in, say, stricmp, but with our profiles that's not the case. Instead I have hundreds of functions that I need to get from, say, 0.6% of frame to 0.4% of frame. I need to shave 10 μs off every 50 μs function that is called 300 times per second. To get that kind of precision, I need more samples.

But at heart what Luke Stackwalker does is what you describe: every millisecond or so, it halts the program and records the callstack (including the precise instruction and line number of the IP). Some programs just need tens of thousands of samples to be usefully profiled.

(We've talked about this before, of course, but I figured this was a good place to summarize the debate.)

0 讨论(0)
发布评论:

提交评论
- 加载中...
太阳男子

2020-11-22 00:57

On Java servers it's always been a neat trick to do 2-3 quick Ctrl-Breakss in a row and get 2-3 threaddumps of all running threads. Simply looking at where all the threads "are" may extremely quickly pinpoint where your performance problems are.

This technique can reveal more performance problems in 2 minutes than any other technique I know of.

0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-11-22 01:00
EDIT 2008/11/25: OK, Vineet's response has finally made me see what the issue is here. Better late than never.

Somehow the idea got loose in the land that performance problems are found by measuring performance. That is confusing means with ends. Somehow I avoided this by single-stepping entire programs long ago. I did not berate myself for slowing it down to human speed. I was trying to see if it was doing wrong or unnecessary things. That's how to make software fast - find and remove unnecessary operations.

Nobody has the patience for single-stepping these days, but the next best thing is to pick a number of cycles at random and ask what their reasons are. (That's what the call stack can often tell you.) If a good percentage of them don't have good reasons, you can do something about it.

It's harder these days, what with threading and asynchrony, but that's how I tune software - by finding unnecessary cycles. Not by seeing how fast it is - I do that at the end.

Here's why sampling the call stack cannot give a wrong answer, and why not many samples are needed.

During the interval of interest, when the program is taking more time than you would like, the call stack exists continuously, even when you're not sampling it.
- If an instruction I is on the call stack for fraction P(I) of that time, removing it from the program, if you could, would save exactly that much. If this isn't obvious, give it a bit of thought.
If the instruction shows up on M = 2 or more samples, out of N, its P(I) is approximately M/N, and is definitely significant.

The only way you can fail to see the instruction is to magically time all your samples for when the instruction is not on the call stack. The simple fact that it is present for a fraction of the time is what exposes it to your probes.

So the process of performance tuning is a simple matter of picking off instructions (mostly function call instructions) that raise their heads by turning up on multiple samples of the call stack. Those are the tall trees in the forest.

Notice that we don't have to care about the call graph, or how long functions take, or how many times they are called, or recursion.

I'm against obfuscation, not against profilers. They give you lots of statistics, but most don't give P(I), and most users don't realize that that's what matters.

You can talk about forests and trees, but for any performance problem that you can fix by modifying code, you need to modify instructions, specifically instructions with high P(I). So you need to know where those are, preferably without playing Sherlock Holmes. Stack sampling tells you exactly where they are.

This technique is harder to employ in multi-thread, event-driven, or systems in production. That's where profilers, if they would report P(I), could really help.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2 3

热议问题