What's up with the “half fence” behavior of rdtscp?

问题

For many years x86 CPUs supported the rdtsc instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a counter that increments at a fixed frequency with respect to wall clock time, so it is very useful as building block for a fast, accurate clock or measuring the time taken by small segments of code.

One important fact about the rdtsc instruction isn't ordered in any special way with the surrounding code. Like most instructions, it can be freely reordered with respect to other instructions which it isn't in a dependency relationship with. This is actually "normal" and for most instructions it's just a mostly invisible way of making the CPU faster (it's just a long winder way of saying out-of-order execution).

For rdtsc it is important because it means you might not be timing the code you expect to be timing. For example, given the following sequence¹:

rdtsc
mov ecx, eax
mov rdi, [rdi]
mov rdi, [rdi]
rdtsc

You might expect rdtsc to measure the latency of the two pointer chasing loading loads mov rdi, [rdi]. In practice, however, even if both of these loads take a look time (100s of cycles if they miss in the cache), you'll get a fairly small reading for the rdtsc pair. The problem is that the second rdtsc doens't wait for the loads to finish, it just executes out of order, so you aren't timing the interval you think you are. Perhaps both rdtsc instruction actually even execute before the first load even starts, depending how rdi was calculated in the code prior to this example.

So far, this is sounding more like an answer to a question nobody asked than a real question, but I'm getting there.

You have two basic use-cases for rdtsc:

As a quick timestamp, in which can you usually don't care exactly how it reorders with the surrounding code, since you probably don't have have an instruction-level concept of where the timestamp should be taken, anyways.
As a precise timing mechanism, e.g., in a micro-benchmark. In this case you'll usually protect your rdtsc from re-ordering with the lfence instruction. For the example above, you might do something like:
```
lfence
rdtsc
lfence
mov ecx, eax
...
lfence
rdtsc
```
To ensure the timed instructions (...) don't escape outside of the timed region, and also to ensure instructions from inside the time region don't come in (probably less of a problem, but they may compete for resources with the code you want to measure).

Years later, Intel looked down upon us poor programmers and came up with a new instruction: rdtscp. Like rdtsc it returns a reading of the time stamp counter, and this guy does something more: it reads a core-specific MSR value atomically with the timestamp reading. On most OSes this contains a core ID value. I think the idea is that this value can be used to properly adjust the returned value to real time on CPUs that may have different TSC offsets per core.

Great.

The other thing rdtscp introduced was half-fencing in terms of out-of-order execution:

From the manual:

The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible.1 But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed.

So it's like putting an lfence before the rdtscp, but not after. What is the point of this half-fencing behavior? If you want a general timestamp and don't care about instruction ordering, the unfenced behavior is what you want. If you want to use this for timing short code sections, the half-fencing behavior is useful only for the second (final) reading, but not for the initial reading, since the fence is on the "wrong" side (in practice you want fences on both sides, but having them on the inside is probably the most important).

What purpose does such half-fencing serve?

¹ I'm ignoring the upper 32-bits of the counter in this case.

来源：https://stackoverflow.com/questions/52158572/whats-up-with-the-half-fence-behavior-of-rdtscp

标签

performance

assembly

x86

microbenchmark

rdtsc