Looking for an accurate way to micro benchmark small code paths written in C++ and running on Linux/OSX

核能气质少年 提交于 2019-12-04 07:06:57
osgx

You can use "rdtsc" processor instruction on x86/x86_64. For multicore systems check the "constant_tsc" capability in CPUID (/proc/cpuinfo in linux) - it will mean that all cores use the same tick counter, even with dynamic freq changing and sleeping.

If your processor does not support constant_tsc, be sure to bind you programm to the core (taskset utility in Linux).

When using rdtsc on out-of-order CPUs (All besides Intel Atom, may be some other low-end cpus), add an "ordering" instruction before, e.g. "cpuid" - it will temporary disable instruction reordering.

Also, MacOsX has "Shark" which can measure some hardware events in your code.

RDTSC and out-of-order CPUs. More info in section 18 of the 2nd great Fog's manual on optimization: Optimizing subroutines in assembly language: An optimization guide for x86 platforms (the main site with all the five manuals is http://www.agner.org/optimize/)

http://www.scribd.com/doc/1548519/optimizing-assembly

On all processors with out-of-order execution, you have to insert XOR EAX,EAX / CPUID before and after each read of the counter in order to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.

This is what I've used in the past:

inline double gettime ()
{
    timeval tv;
    gettimeofday (&tv, NULL);
    return double (tv.tv_sec) + 0.000001 * tv.tv_usec;
}

And then:

double startTime = gettime();
// your code here
double runTime = gettime() - startTime;

This will quote to the microsecond.

Cachegrind / kCachegrind are good for very fine-grained profiling. I don't believe they're available for OS X, but the results you get on Linux should be representative.

Microbenchmark should run the same code in a loop, preferably over lots of iteration. I used the following and ran it with time(1) utility;

following caveats were observed

  • if the test does not produce a computation that is printed out then code is eliminated by optimization - gcc with -O3 does that.

  • the test functions of test() and lookup() must be implemented in a different source file than the loop of the iteration; if they are in the same file and the lookup function returns constant value then code optimization would not call it, not once once, it would just multiply the return value by number of iterations !

file main.c

#include <stdio.h>

#define RUN_COUNT 10000000

void init();
int  lookup();


main()
{
  int sum = 0;
  int i;

  init();


  for(i = 0; i < RUN_COUNT; i++ ) {
    sum  += lookup();
  }

  printf("%d", sum );
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!