Negative clock cycle measurements with back-to-back rdtsc?

后端 未结 9 1152
情话喂你
情话喂你 2020-11-27 04:17

I am writing a C code for measuring the number of clock cycles needed to acquire a semaphore. I am using rdtsc, and before doing the measurement on the semaphore, I call rdt

相关标签:
9条回答
  • 2020-11-27 04:39

    If the thread that is running your code is moving between cores then it's possible that the rdtsc value returned is less than the value read on another core. The core's don't all set the counter to 0 at exactly the same time when the package powers up. So make sure you set thread affinity to a specific core when you run your test.

    0 讨论(0)
  • 2020-11-27 04:40

    rdtsc can be used to get a reliable and very precise elapsed time. If using linux you can see if your processor supports a constant rate tsc by looking in /proc/cpuinfo to see if you have constant_tsc defined.

    Make sure that you stay on the same core. Every core has its own tsc which has its own value. To use rdtsc make sure that you either taskset, or SetThreadAffinityMask (windows) or pthread_setaffinity_np to ensure that your process stays on the same core.

    Then you divide this by your main clock rate which on linux can be found in /proc/cpuinfo or you can do this at runtime by

    rdtsc
    clock_gettime
    sleep for 1 second
    clock_gettime
    rdtsc

    then see how many ticks per second, and then you can divide any difference in ticks to find out how much time has elapsed.

    0 讨论(0)
  • 2020-11-27 04:47

    When Intel first invented the TSC it measured CPU cycles. Due to various power management features "cycles per second" is not constant; so TSC was originally good for measuring the performance of code (and bad for measuring time passed).

    For better or worse; back then CPUs didn't really have too much power management, often CPUs ran at a fixed "cycles per second" anyway. Some programmers got the wrong idea and misused the TSC for measuring time and not cycles. Later (when the use of power management features became more common) these people misusing TSC to measure time whined about all the problems that their misuse caused. CPU manufacturers (starting with AMD) changed TSC so it measures time and not cycles (making it broken for measuring the performance of code, but correct for measuring time passed). This caused confusion (it was hard for software to determine what TSC actually measured), so a little later on AMD added the "TSC Invariant" flag to CPUID, so that if this flag is set programmers know that the TSC is broken (for measuring cycles) or fixed (for measuring time).

    Intel followed AMD and changed the behaviour of their TSC to also measure time, and also adopted AMD's "TSC Invariant" flag.

    This gives 4 different cases:

    • TSC measures both time and performance (cycles per second is constant)

    • TSC measures performance not time

    • TSC measures time and not performance but doesn't use the "TSC Invariant" flag to say so

    • TSC measures time and not performance and does use the "TSC Invariant" flag to say so (most modern CPUs)

    For cases where TSC measures time, to measure performance/cycles properly you have to use performance monitoring counters. Sadly, performance monitoring counters are different for different CPUs (model specific) and requires access to MSRs (privileged code). This makes it considerably impractical for applications to measure "cycles".

    Also note that if the TSC does measure time, you can't know what time scale it returns (how many nanoseconds in a "pretend cycle") without using some other time source to determine a scaling factor.

    The second problem is that for multi-CPU systems most operating systems suck. The correct way for an OS to handle the TSC is to prevent applications from using it directly (by setting the TSD flag in CR4; so that the RDTSC instruction causes an exception). This prevents various security vulnerabilities (timing side-channels). It also allows the OS to emulate the TSC and ensure it returns a correct result. For example, when an application uses the RDTSC instruction and causes an exception, the OS's exception handler can figure out a correct "global time stamp" to return.

    Of course different CPUs have their own TSC. This means that if an application uses TSC directly they get different values on different CPUs. To help people work around the OS's failure to fix the problem (by emulating RDTSC like they should); AMD added the RDTSCP instruction, which returns the TSC and a "processor ID" (Intel ended up adopting the RDTSCP instruction too). An application running on a broken OS can use the "processor ID" to detect when they're running on a different CPU from last time; and in this way (using the RDTSCP instruction) they can know when "elapsed = TSC - previous_TSC" gives an in valid result. However; the "processor ID" returned by this instruction is just a value in an MSR, and the OS has to set this value on each CPU to something different - otherwise RDTSCP will say that the "processor ID" is zero on all CPUs.

    Basically; if the CPUs supports the RDTSCP instruction, and if the OS has correctly set the "processor ID" (using the MSR); then the RDTSCP instruction can help applications know when they've got a bad "elapsed time" result (but it doesn't provide anyway of fixing or avoiding the bad result).

    So; to cut a long story short, if you want an accurate performance measurement you're mostly screwed. The best you can realistically hope for is an accurate time measurement; but only in some cases (e.g. when running on a single-CPU machine or "pinned" to a specific CPU; or when using RDTSCP on OSs that set it up properly as long as you detect and discard invalid values).

    Of course even then you'll get dodgy measurements because of things like IRQs. For this reason; it's best to run your code many times in a loop and discard any results that are too much higher than other results.

    Finally, if you really want to do it properly you should measure the overhead of measuring. To do this you'd measure how long it takes to do nothing (just the RDTSC/RDTSCP instruction alone, while discarding dodgy measurements); then subtract the overhead of measuring from the "measuring something" results. This gives you a better estimate of the time "something" actually takes.

    Note: If you can dig up a copy of Intel's System Programming Guide from when Pentium was first released (mid 1990s - not sure if it's available online anymore - I have archived copies since the 1980s) you'll find that Intel documented the time stamp counter as something that "can be used to monitor and identify the relative time of occurrence of processor events". They guaranteed that (excluding 64-bit wrap-around) it would monotonically increase (but not that it would increase at a fixed rate) and that it'd take a minimum of 10 years before it wrapped around. The latest revision of the manual documents the time stamp counter with more detail, stating that for older CPUs (P6, Pentium M, older Pentium 4) the time stamp counter "increments with every internal processor clock cycle" and that "Intel(r) SpeedStep(r) technology transitions may impact the processor clock"; and that newer CPUs (newer Pentium 4, Core Solo, Core Duo, Core 2, Atom) the TSC increments at a constant rate (and that this is the "architectural behaviour moving forward"). Essentially, from the very beginning it was a (variable) "internal cycle counter" to be used for a time-stamp (and not a time counter to be used to track "wall clock" time), and this behaviour changed soon after the year 2000 (based on Pentium 4 release date).

    0 讨论(0)
提交回复
热议问题