#include
static inline unsigned long long tick()
{
unsigned long long d;
__asm__ __volatile__ (\"rdtsc\" : \"=A\" (d) );
ret
It looks like your OS disabled execution of RDTSC in user space. And your application has to switch to kernel and back, which takes a lot of cycles.
This is from the Intel Software Developer’s Manual:
When in protected or virtual 8086 mode, the time stamp disable (TSD) flag in register CR4 restricts the use of the RDTSC instruction as follows. When the TSD flag is clear, the RDTSC instruction can be executed at any privilege level; when the flag is set, the instruction can only be executed at privilege level 0. (When in real-address mode, the RDTSC instruction is always enabled.)
Edit:
Answering aix's comment, I explain, why TSD is most likely the reason here.
I know only these possibilities for a program to perform a single instruction longer than usual:
First 2 reasons cannot usually delay execution for more than a few hundred cycles. 2000-2500 cycles are more typical for context/kernel switch. But it is practically impossible to catch a context switch several times on the same place. So it should be kernel switch. Which means that either program is running under debugger or RDTSC is not allowed in user mode.
The most likely reason for OS to disable RDTSC may be security. There were attempts to use RDTSC to crack encryption programs.