I am writing a C code for measuring the number of clock cycles needed to acquire a semaphore. I am using rdtsc, and before doing the measurement on the semaphore, I call rdt
do not use avg value
Use the smallest one or avg of smaller values instead (to get avg because of CACHE's) because the bigger ones has been interrupted by OS multi tasking.
You could also remember all values and then found the OS process granularity boundary and filter out all values after this boundary (usually > 1ms
which is easily detectable)
no need to measure overhead of RDTSC
You just measure offseted by some time and the same offset is present in both times and after substraction it is gone.
for variable clock source of RDTS
(like on laptops)
You should change the speed of CPU to its max by some steady intensive computation loop usually few seconds are enough. You should measure the CPU frequency continuosly and start measure your thing only when it is stable enough.
In the face of thermal and idle throttling, mouse-motion and network traffic interrupts, whatever it's doing with the GPU, and all the other overhead that a modern multicore system can absorb without anyone much caring, I think your only reasonable course for this is to accumulate a few thousand individual samples and just toss the outliers before taking the median or mean (not a statistician but I'll venture it won't make much difference here).
I'd think anything you do to eliminate the noise of a running system will skew the results much worse than just accepting that there's no way you'll ever be able to reliably predict how long it'll take anything to complete these days.
If you code starts off on one processor then swaps to another, the timestamp difference may be negative due to processors sleeping etc.
Try setting the processor affinity before you start measuring.
I can't see if you are running under Windows or Linux from the question, so I'll answer for both.
Windows:
DWORD affinityMask = 0x00000001L;
SetProcessAffinityMask(GetCurrentProcessId(), affinityMask);
Linux:
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
sched_setaffinity (getpid(), sizeof(cpuset), &cpuset)
The other answers are great (go read them), but assume that rdtsc
being read correctly. This answer is addressing the inline-asm bug that leads to totally bogus results, including negative.
The other possibility is that you were compiling this as 32-bit code, but with many more repeats, and got an occasional negative interval on CPU migration on a system that doesn't have invariant-TSC (synced TSCs across all cores). Either a multi-socket system, or an older multi-core. CPU TSC fetch operation especially in multicore-multi-processor environment.
If you were compiling for x86-64, your negative results are fully explained by your incorrect "=A"
output constraint for asm
. See Get CPU cycle count? for correct ways to use rdtsc that are portable to all compilers and 32 vs. 64-bit mode. Or use "=a"
and "=d"
outputs and simply ignore the high half output, for short intervals that won't overflow 32 bits.)
(I'm surprised you didn't mention them also being huge and wildly-varying, as well as overflowing tot
to give a negative average even if no individual measurements were negative. I'm seeing averages like -63421899
, or 69374170
, or 115365476
.)
Compiling it with gcc -O3 -m32
makes it work as expected, printing averages of 24 to 26 (if run in a loop so the CPU stays at top speed, otherwise like 125 reference cycles for the 24 core clock cycles between back-to-back rdtsc
on Skylake). https://agner.org/optimize/ for instruction tables.
"=A"
constraintrdtsc (insn ref manual entry) always produces the two 32-bit hi:lo
halves of its 64-bit result in edx:eax
, even in 64-bit mode where we're really rather have it in a single 64-bit register.
You were expecting the "=A"
output constraint to pick edx:eax
for uint64_t t
. But that's not what happens. For a variable that fits in one register, the compiler picks either RAX
or RDX
and assumes the other is unmodified, just like a "=r"
constraint picks one register and assumes the rest are unmodified. Or an "=Q"
constraint picks one of a,b,c, or d. (See x86 constraints).
In x86-64, you'd normally only want "=A"
for an unsigned __int128
operand, like a multiple result or div
input. It's kind of a hack because using %0
in the asm template only expands to the low register, and there's no warning when "=A"
doesn't use both a
and d
registers.
To see exactly how this causes a problem, I added a comment inside the asm template:
__asm__ volatile ("rdtsc # compiler picked %0" : "=A"(t));
. So we can see what the compiler expects, based on what we told it with operands.
The resulting loop (in Intel syntax) looks like this, from compiling a cleaned up version of your code on the Godbolt compiler explorer for 64-bit gcc and 32-bit clang:
# the main loop from gcc -O3 targeting x86-64, my comments added
.L6:
rdtsc # compiler picked rax # c1 = rax
rdtsc # compiler picked rdx # c2 = rdx, not realizing that rdtsc clobbers rax(c1)
# compiler thinks RAX=c1, RDX=c2
# actual situation: RAX=low half of c2, RDX=high half of c2
sub edx, eax # tsccost = edx-eax
js .L3 # jump if the sign-bit is set in tsccost
... rest of loop back to .L6
When the compiler is calculating c2-c1
, it's actually calculating hi-lo
from the 2nd rdtsc
, because we lied to the compiler about what the asm statement does. The 2nd rdtsc
clobbered c1
We told it that it had a choice of which register to get the output in, so it picked one register the first time, and the other the 2nd time, so it wouldn't need any mov
instructions.
The TSC counts reference cycles since the last reboot. But the code doesn't depend on hi<lo
, it just depends on the sign of hi-lo
. Since lo
wraps around every second or two (2^32 Hz is close to 4.3GHz), running the program at any given time has approximately a 50% chance of seeing a negative result.
It doesn't depend on the current value of hi
; there's maybe a 1 part in 2^32
bias in one direction or the other because hi
changes by one when lo
wraps around.
Since hi-lo
is a nearly uniformly distributed 32-bit integer, overflow of the average is very common. Your code is ok if the average is normally small. (But see other answers for why you don't want the mean; you want to median or something to exclude outliers.)
The principal point of my question was not the accuracy of the result, but the fact that I am getting negative values every now and then (first call to rdstc gives bigger value than second call). Doing more research (and reading other questions on this website), I found out that a way for getting things work when using rdtsc is to put a cpuid command just before it. This command serializes the code. This is how I am doing things now:
static inline uint64_t get_cycles()
{
uint64_t t;
volatile int dont_remove __attribute__((unused));
unsigned tmp;
__asm volatile ("cpuid" : "=a"(tmp), "=b"(tmp), "=c"(tmp), "=d"(tmp)
: "a" (0));
dont_remove = tmp;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
I am still getting a NEGATIVE difference between second call and first call of the get_cycles function. WHY? I am not 100% sure about the syntax of the cpuid assembly inline code, this is what I found looking on the internet.
I tested your code on my machine and I figured that during RDTSC fuction only uint32_t is reasonable.
I do the following in my code to correct it:
if(before_t<after_t){ diff_t=before_t + 4294967296 -after_t;}