问题

I am writing a C code for measuring the number of clock cycles needed to acquire a semaphore. I am using rdtsc, and before doing the measurement on the semaphore, I call rdtsc two consecutive times, to measure the overhead. I repeat this many times, in a for-loop, and then I use the average value as rdtsc overhead.

Is this correct, to use the average value, first of all?

Nonetheless, the big problem here is that sometimes I get negative values for the overhead (not necessarily the averaged one,but at least the partial ones inside the for loop).

This also affects the consecutive calculation of the number of cpu cycles needed for the sem_wait() operation, which sometimes also turns out to be negative. If what I wrote is not clear, here there's a part of the code I am working on.

Why am I getting such negative values?

(editor's note: see Get CPU cycle count? for a correct and portable way of getting the full 64-bit timestamp. An "=A" asm constraint will only get the low or high 32 bits when compiled for x86-64, depending on whether register allocation happens to pick RAX or RDX for the uint64_t output. It won't pick edx:eax.)

(editor's 2nd note: oops, that's the answer to why we're getting negative results. Still worth leaving a note here as a warning not to copy this rdtsc implementation.)

#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>

static inline uint64_t get_cycles()
{
  uint64_t t;
           // editor's note: "=A" is unsafe for this in x86-64
  __asm volatile ("rdtsc" : "=A"(t));
  return t;
}

int num_measures = 10;

int main ()
{
   int i, value, res1, res2;
   uint64_t c1, c2;
   int tsccost, tot, a;

   tot=0;    

   for(i=0; i<num_measures; i++)
   {    
      c1 = get_cycles();
      c2 = get_cycles();

      tsccost=(int)(c2-c1);


      if(tsccost<0)
      {
         printf("####  ERROR!!!   ");
         printf("rdtsc took %d clock cycles\n", tsccost);
         return 1;
      }   
      tot = tot+tsccost;
   }

   tsccost=tot/num_measures;
   printf("rdtsc takes on average: %d clock cycles\n", tsccost);      

   return EXIT_SUCCESS;
}

回答1:

When Intel first invented the TSC it measured CPU cycles. Due to various power management features "cycles per second" is not constant; so TSC was originally good for measuring the performance of code (and bad for measuring time passed).

For better or worse; back then CPUs didn't really have too much power management, often CPUs ran at a fixed "cycles per second" anyway. Some programmers got the wrong idea and misused the TSC for measuring time and not cycles. Later (when the use of power management features became more common) these people misusing TSC to measure time whined about all the problems that their misuse caused. CPU manufacturers (starting with AMD) changed TSC so it measures time and not cycles (making it broken for measuring the performance of code, but correct for measuring time passed). This caused confusion (it was hard for software to determine what TSC actually measured), so a little later on AMD added the "TSC Invariant" flag to CPUID, so that if this flag is set programmers know that the TSC is broken (for measuring cycles) or fixed (for measuring time).

Intel followed AMD and changed the behaviour of their TSC to also measure time, and also adopted AMD's "TSC Invariant" flag.

This gives 4 different cases:

TSC measures both time and performance (cycles per second is constant)
TSC measures performance not time
TSC measures time and not performance but doesn't use the "TSC Invariant" flag to say so
TSC measures time and not performance and does use the "TSC Invariant" flag to say so (most modern CPUs)

For cases where TSC measures time, to measure performance/cycles properly you have to use performance monitoring counters. Sadly, performance monitoring counters are different for different CPUs (model specific) and requires access to MSRs (privileged code). This makes it considerably impractical for applications to measure "cycles".

Also note that if the TSC does measure time, you can't know what time scale it returns (how many nanoseconds in a "pretend cycle") without using some other time source to determine a scaling factor.

The second problem is that for multi-CPU systems most operating systems suck. The correct way for an OS to handle the TSC is to prevent applications from using it directly (by setting the TSD flag in CR4; so that the RDTSC instruction causes an exception). This prevents various security vulnerabilities (timing side-channels). It also allows the OS to emulate the TSC and ensure it returns a correct result. For example, when an application uses the RDTSC instruction and causes an exception, the OS's exception handler can figure out a correct "global time stamp" to return.

Of course different CPUs have their own TSC. This means that if an application uses TSC directly they get different values on different CPUs. To help people work around the OS's failure to fix the problem (by emulating RDTSC like they should); AMD added the RDTSCP instruction, which returns the TSC and a "processor ID" (Intel ended up adopting the RDTSCP instruction too). An application running on a broken OS can use the "processor ID" to detect when they're running on a different CPU from last time; and in this way (using the RDTSCP instruction) they can know when "elapsed = TSC - previous_TSC" gives an in valid result. However; the "processor ID" returned by this instruction is just a value in an MSR, and the OS has to set this value on each CPU to something different - otherwise RDTSCP will say that the "processor ID" is zero on all CPUs.

Basically; if the CPUs supports the RDTSCP instruction, and if the OS has correctly set the "processor ID" (using the MSR); then the RDTSCP instruction can help applications know when they've got a bad "elapsed time" result (but it doesn't provide anyway of fixing or avoiding the bad result).

So; to cut a long story short, if you want an accurate performance measurement you're mostly screwed. The best you can realistically hope for is an accurate time measurement; but only in some cases (e.g. when running on a single-CPU machine or "pinned" to a specific CPU; or when using RDTSCP on OSs that set it up properly as long as you detect and discard invalid values).

Of course even then you'll get dodgy measurements because of things like IRQs. For this reason; it's best to run your code many times in a loop and discard any results that are too much higher than other results.

Finally, if you really want to do it properly you should measure the overhead of measuring. To do this you'd measure how long it takes to do nothing (just the RDTSC/RDTSCP instruction alone, while discarding dodgy measurements); then subtract the overhead of measuring from the "measuring something" results. This gives you a better estimate of the time "something" actually takes.

Note: If you can dig up a copy of Intel's System Programming Guide from when Pentium was first released (mid 1990s - not sure if it's available online anymore - I have archived copies since the 1980s) you'll find that Intel documented the time stamp counter as something that "can be used to monitor and identify the relative time of occurrence of processor events". They guaranteed that (excluding 64-bit wrap-around) it would monotonically increase (but not that it would increase at a fixed rate) and that it'd take a minimum of 10 years before it wrapped around. The latest revision of the manual documents the time stamp counter with more detail, stating that for older CPUs (P6, Pentium M, older Pentium 4) the time stamp counter "increments with every internal processor clock cycle" and that "Intel(r) SpeedStep(r) technology transitions may impact the processor clock"; and that newer CPUs (newer Pentium 4, Core Solo, Core Duo, Core 2, Atom) the TSC increments at a constant rate (and that this is the "architectural behaviour moving forward"). Essentially, from the very beginning it was a (variable) "internal cycle counter" to be used for a time-stamp (and not a time counter to be used to track "wall clock" time), and this behaviour changed soon after the year 2000 (based on Pentium 4 release date).

回答2:

do not use avg value

Use the smallest one or avg of smaller values instead (to get avg because of CACHE's) because the bigger ones has been interrupted by OS multi tasking.

You could also remember all values and then found the OS process granularity boundary and filter out all values after this boundary (usually > 1ms which is easily detectable)
no need to measure overhead of RDTSC

You just measure offseted by some time and the same offset is present in both times and after substraction it is gone.
for variable clock source of RDTS (like on laptops)

You should change the speed of CPU to its max by some steady intensive computation loop usually few seconds are enough. You should measure the CPU frequency continuosly and start measure your thing only when it is stable enough.

回答3:

If you code starts off on one processor then swaps to another, the timestamp difference may be negative due to processors sleeping etc.

Try setting the processor affinity before you start measuring.

I can't see if you are running under Windows or Linux from the question, so I'll answer for both.

Windows:

DWORD affinityMask = 0x00000001L;
SetProcessAffinityMask(GetCurrentProcessId(), affinityMask);

Linux:

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
sched_setaffinity (getpid(), sizeof(cpuset), &cpuset)

回答4:

The other answers are great (go read them), but assume that rdtsc being read correctly. This answer is addressing the inline-asm bug that leads to totally bogus results, including negative.

The other possibility is that you were compiling this as 32-bit code, but with many more repeats, and got an occasional negative interval on CPU migration on a system that doesn't have invariant-TSC (synced TSCs across all cores). Either a multi-socket system, or an older multi-core. CPU TSC fetch operation especially in multicore-multi-processor environment.

If you were compiling for x86-64, your negative results are fully explained by your incorrect "=A" output constraint for asm. See Get CPU cycle count? for correct ways to use rdtsc that are portable to all compilers and 32 vs. 64-bit mode. Or use "=a" and "=d" outputs and simply ignore the high half output, for short intervals that won't overflow 32 bits.)

(I'm surprised you didn't mention them also being huge and wildly-varying, as well as overflowing tot to give a negative average even if no individual measurements were negative. I'm seeing averages like -63421899, or 69374170, or 115365476.)

Compiling it with gcc -O3 -m32 makes it work as expected, printing averages of 24 to 26 (if run in a loop so the CPU stays at top speed, otherwise like 125 reference cycles for the 24 core clock cycles between back-to-back rdtsc on Skylake). https://agner.org/optimize/ for instruction tables.

Asm details of what went wrong with the `"=A"` constraint

rdtsc (insn ref manual entry) always produces the two 32-bit hi:lo halves of its 64-bit result in edx:eax, even in 64-bit mode where we're really rather have it in a single 64-bit register.

You were expecting the "=A" output constraint to pick edx:eax for uint64_t t. But that's not what happens. For a variable that fits in one register, the compiler picks either RAX or RDX and assumes the other is unmodified, just like a "=r" constraint picks one register and assumes the rest are unmodified. Or an "=Q" constraint picks one of a,b,c, or d. (See x86 constraints).

In x86-64, you'd normally only want "=A" for an unsigned __int128 operand, like a multiple result or div input. It's kind of a hack because using %0 in the asm template only expands to the low register, and there's no warning when "=A" doesn't use both a and d registers.

To see exactly how this causes a problem, I added a comment inside the asm template:
__asm__ volatile ("rdtsc # compiler picked %0" : "=A"(t));. So we can see what the compiler expects, based on what we told it with operands.

The resulting loop (in Intel syntax) looks like this, from compiling a cleaned up version of your code on the Godbolt compiler explorer for 64-bit gcc and 32-bit clang:

# the main loop from gcc -O3  targeting x86-64, my comments added
.L6:
    rdtsc  # compiler picked rax     # c1 = rax
    rdtsc  # compiler picked rdx     # c2 = rdx, not realizing that rdtsc clobbers rax(c1)

      # compiler thinks   RAX=c1,               RDX=c2
      # actual situation: RAX=low half of c2,   RDX=high half of c2

    sub     edx, eax                 # tsccost = edx-eax
    js      .L3                      # jump if the sign-bit is set in tsccost
   ... rest of loop back to .L6

When the compiler is calculating c2-c1, it's actually calculating hi-lo from the 2nd rdtsc, because we lied to the compiler about what the asm statement does. The 2nd rdtsc clobbered c1

We told it that it had a choice of which register to get the output in, so it picked one register the first time, and the other the 2nd time, so it wouldn't need any mov instructions.

The TSC counts reference cycles since the last reboot. But the code doesn't depend on hi<lo, it just depends on the sign of hi-lo. Since lo wraps around every second or two (2^32 Hz is close to 4.3GHz), running the program at any given time has approximately a 50% chance of seeing a negative result.

It doesn't depend on the current value of hi; there's maybe a 1 part in 2^32 bias in one direction or the other because hi changes by one when lo wraps around.

Since hi-lo is a nearly uniformly distributed 32-bit integer, overflow of the average is very common. Your code is ok if the average is normally small. (But see other answers for why you don't want the mean; you want to median or something to exclude outliers.)

回答5:

The principal point of my question was not the accuracy of the result, but the fact that I am getting negative values every now and then (first call to rdstc gives bigger value than second call). Doing more research (and reading other questions on this website), I found out that a way for getting things work when using rdtsc is to put a cpuid command just before it. This command serializes the code. This is how I am doing things now:

static inline uint64_t get_cycles()
{
  uint64_t t;          

   volatile int dont_remove __attribute__((unused));
   unsigned tmp;
     __asm volatile ("cpuid" : "=a"(tmp), "=b"(tmp), "=c"(tmp), "=d"(tmp)
       : "a" (0));

   dont_remove = tmp; 




  __asm volatile ("rdtsc" : "=A"(t));
  return t;
}

I am still getting a NEGATIVE difference between second call and first call of the get_cycles function. WHY? I am not 100% sure about the syntax of the cpuid assembly inline code, this is what I found looking on the internet.

回答6:

In the face of thermal and idle throttling, mouse-motion and network traffic interrupts, whatever it's doing with the GPU, and all the other overhead that a modern multicore system can absorb without anyone much caring, I think your only reasonable course for this is to accumulate a few thousand individual samples and just toss the outliers before taking the median or mean (not a statistician but I'll venture it won't make much difference here).

I'd think anything you do to eliminate the noise of a running system will skew the results much worse than just accepting that there's no way you'll ever be able to reliably predict how long it'll take anything to complete these days.

回答7:

rdtsc can be used to get a reliable and very precise elapsed time. If using linux you can see if your processor supports a constant rate tsc by looking in /proc/cpuinfo to see if you have constant_tsc defined.

Make sure that you stay on the same core. Every core has its own tsc which has its own value. To use rdtsc make sure that you either taskset, or SetThreadAffinityMask (windows) or pthread_setaffinity_np to ensure that your process stays on the same core.

Then you divide this by your main clock rate which on linux can be found in /proc/cpuinfo or you can do this at runtime by

rdtsc
clock_gettime
sleep for 1 second
clock_gettime
rdtsc

then see how many ticks per second, and then you can divide any difference in ticks to find out how much time has elapsed.

回答8:

If the thread that is running your code is moving between cores then it's possible that the rdtsc value returned is less than the value read on another core. The core's don't all set the counter to 0 at exactly the same time when the package powers up. So make sure you set thread affinity to a specific core when you run your test.

回答9:

I tested your code on my machine and I figured that during RDTSC fuction only uint32_t is reasonable.

I do the following in my code to correct it:

if(before_t<after_t){ diff_t=before_t + 4294967296 -after_t;}

来源：https://stackoverflow.com/questions/19941588/negative-clock-cycle-measurements-with-back-to-back-rdtsc

标签

x86-64

inline-assembly

overhead

rdtsc

Negative clock cycle measurements with back-to-back rdtsc?

问题