How to accurately measure clock cycles used by a c++ function?

问题

I know that I have to use: rdtsc. The measured function is deterministic but the result is far from being repeatable (I get 5% oscillations from run to run). Possible causes are:

context switching
cache misses

Do you know any other causes? How to eliminate them?

回答1:

TSCs (what rdtsc uses) are often not synchronized on multi-processor systems. It may help to set the CPU affinity in order to bind the process to a single CPU.

You could also get timestamps from HPET timers if available, which aren't prone to the same problem.

As for repeatability, those variances are true. You could disable caching, give a realtime priority to the process and/or (if on Linux or something similar) recompile your kernel with a lower, fixed timer interrupt frequency (the one that does time-slicing). You can't eliminate the variance completely, at least not easily and not on regular CPU + OS combos.

In general, for easy coding, reliability and portability reasons, I suggest you use what the OS has to offer. If it offers high-precision timers, use the appropriate OS helper.

(Just in case you're trying a time attack on a crypto system, well, you're going to have to live with 1. this randomness and 2. general defenses that make the system unpredictable for good reasons, so the function might not be deterministic with respect to time.)

EDIT: added paragraph about timers the OS can offer.

EDIT: This refers to Linux. For binding a process to a single CPU (to have an accurate read from RDTSC), you can use sched_setaffinity(2). And here's some code from one of my projects using it for some other purpose (mapping threads to CPUs). This should be your first try. As for HPETs, you can use regular POSIX calls like these, as long as the kernel and the machine support these timers.

回答2:

Why eliminate them? Sounds like you've created a realistic benchmark. That code would have the same kind of variability when used in the wild as well. Probably worse since you're likely to have eliminated disk and CPU cache latencies. Using Jon Skeet's approach, creating conditions that gives you the best possible result, will only leave you with a result that makes you feel good but is never achievable.

If an absolute number is important, calculate the median, not the average.

回答3:

See the question Is stopwatch benchmarking acceptable? for a discourse on the variance of micro-benchmarking on a modern multi-core multi-threading multi-process machine.

Although the question is about Java, the considerations apply to benchmarking in any language.

Also see: How do I write a correct micro-benchmark in Java?

Alse see: What advice can you give me for writing a meaningful benchmark?

回答4:

Actually there is the new perf subsystem in newer Linux kernels. Example:

$ ./perf stat du -s /tmp
94796   /tmp

 Performance counter stats for 'du -s /tmp':

          2.546403 task-clock-msecs         #      0.060 CPUs 
                 3 context-switches         #      0.001 M/sec
                 0 CPU-migrations           #      0.000 M/sec
               166 page-faults              #      0.065 M/sec
           2434963 cycles                   #    956.236 M/sec
           1798092 instructions             #      0.738 IPC  
            302969 branches                 #    118.979 M/sec
             26197 branch-misses            #      8.647 %    
             23217 cache-references         #      9.118 M/sec
              4621 cache-misses             #      1.815 M/sec

        0.042406580  seconds time elapsed

回答5:

Adding to the list of causes: branch prediction/misprediction (this can be effected by a context switch with complex prediction caches on some chips. Also prediction can be affected by different inputs to your program and so direct comparison between the time for two different data sets can be slightly skewed.

In general, it is nigh impossible to mitigate all of these, but there are some things you can do to help each one:

Cache miss: "Prime" the cache before beginning timing. Don't forget there is an instruction cache which also needs to be primed. For small data sets just run the entire test once without timing, then run it again with timing. For large data sets do this, but then use the processor's precache instruction to load the first data block back in to the cache.
Context switch: Use a multi-processor/core chip on a lightly loaded system and set the process's affinity to a specific CPU (preferably not CPU 0). This will also help with cache misses (since moving CPUs means the cache is completely lost) and branch prediction (since really it is a form of cache).

But, of course, by far the best method of doing timings like this is to do them very many times on very large pieces of data such that the variability introduced by things you can't control is minimized. (It can never really be erased.)

回答6:

Most modern processors support a remarkable set of low-level hardware performance counters. If you really want to know the answers, including real measurements on cache misses and context switching overhead, grab the PAPI (Performance API) toolkit, then, on some (though not all) OSes install one kernel patch, and, with some additional effort, you're off and running.

来源：https://stackoverflow.com/questions/505738/how-to-accurately-measure-clock-cycles-used-by-a-c-function

标签

performance

benchmarking