Consider the following code segment:
#include
#include
#include
#define ARRAYSIZE(arr) (sizeof(arr)/sizeof(arr[
Your code does almost nothing in func
, and the little you do gets inlined into test
, and probably optimized out since you never use the return value.
gcc -O3 gives me -
0000000000400620 :
400620: 53 push %rbx
400621: 0f a2 cpuid
400623: 0f 31 rdtsc
400625: 48 89 d7 mov %rdx,%rdi
400628: 48 89 c6 mov %rax,%rsi
40062b: 0f a2 cpuid
40062d: 0f 31 rdtsc
40062f: 5b pop %rbx
...
So you're measuring time for the two moves that are very cheap HW-wise - your measurement is probably showing the latency of cpuid
which is relatively expensive..
Worse, your clflush
would actually flush test
as well, this means you pay the re-fetch penalty when you next access it, which is out of the rdtsc
pair so it's not measured. The measured code on the other hand, sequentially follows, so fetching test
would probably also fetch the flushed code you measure, so it could actually be cached by the time you measure it.