rdtsc, too many cycles

北城以北 提交于 2019-11-30 20:09:22

I've tried your code on several Linux distros running on different Intel CPUs (admittedly all more recent than the Pentium 4 HT 630 you appear to be using). In all those tests I got values between 25 and 50 cycles.

My only hypothesis that's consistent with all the evidence is that you're running your operating system inside a virtual machine rather than on bare metal, and TSC is getting virtualized.

There are any number of reasons to get a large number:

  • The OS did a context switch, and your process got put to sleep.
  • A disk seek occurred, and your process got put to sleep.
  • …any of a slew of reasons as to why your process might get ignored.

Note that rdtsc is not particularly reliable for timing without work, because:

  • Processor speeds can change, and thus, the length of a cycle (when measured in seconds) changes.
  • Different processors may have different values for the TSC for a given instant in time.

Most operatings systems have a high-precision clock or timing method. clock_gettime on Linux for example, particularly the monotonic clocks. (Understand too the difference between a wall-clock and a monotonic clock: a wall clock can move backwards — even in UTC.) On Windows, I think the recommendation is QueryHighPerformanceCounter. Typically these clocks provide more than enough accuracy for most needs.


Also, looking at the assembly, it looks like you're only getting 32-bits of the answer: I don't see %edx getting saved after rdtsc.


Running your code, I get timings from 120-150 ns for clock_gettime using CLOCK_MONOTONIC, and 70-90 cycles for rdtsc (~20 ns at full speed, but I suspect the processor is clocked down, and that's really about 50 ns). (On a laptopdesktop (darn SSH, forgot which machine I was on!) that is at about a constant 20% CPU use) Sure your machine isn't bogged down?

It looks like your OS disabled execution of RDTSC in user space. And your application has to switch to kernel and back, which takes a lot of cycles.

This is from the Intel Software Developer’s Manual:

When in protected or virtual 8086 mode, the time stamp disable (TSD) flag in register CR4 restricts the use of the RDTSC instruction as follows. When the TSD flag is clear, the RDTSC instruction can be executed at any privilege level; when the flag is set, the instruction can only be executed at privilege level 0. (When in real-address mode, the RDTSC instruction is always enabled.)

Edit:

Answering aix's comment, I explain, why TSD is most likely the reason here.

I know only these possibilities for a program to perform a single instruction longer than usual:

  1. Running under some emulator,
  2. using self-modified code,
  3. context switch,
  4. kernel switch.

First 2 reasons cannot usually delay execution for more than a few hundred cycles. 2000-2500 cycles are more typical for context/kernel switch. But it is practically impossible to catch a context switch several times on the same place. So it should be kernel switch. Which means that either program is running under debugger or RDTSC is not allowed in user mode.

The most likely reason for OS to disable RDTSC may be security. There were attempts to use RDTSC to crack encryption programs.

Perig

Instruction cache miss? (this is my guess)

Also, possibly,

Switch to hypervisor in a virtualized system? Remnants of program bootstrap (including network activity on same CPU)?

To Thanatos: On systems more recent than 2008, rdtsc() is a wall clock and does not vary with frequency steps.

Can you try this little code?

int main()
{   
    long long res;

    fflush(stdout);           // chnage the exact timing of stdout, in case there is something to write in a ssh connection, together with its interrupts

    for (int pass = 0; pass < 2; pass++)
    {
    res=tick();
    res=tick()-res;
    }
    printf("%d",res);     // ignore result on first pass, display the result on second pass.
    return 0;
}

Just an idea - maybe these two rdtsc instructions are executed on different cores? rdtsc values may slightly vary across cores.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!