I am using rdtsc and cpuid instructions (using volatile inline assembly instructions) to measure the CPU cycles of a program. The rdtsc instruction gives realistic results for m
I don't know if it is(was) correct, but the code I once used was:
#define rdtscll(val) \
__asm__ __volatile__("rdtsc" : "=A" (val))
typedef unsigned unsigned long long Ull;
static inline Ull myget_cycles (void)
{
Ull ret;
rdtscll(ret);
return ret;
}
I remember it was "slower" on Intel than on AMD. YMMV.