Is mfence for rdtsc necessary on x86_64 platform?

后端未结

关注

 2  1128

无人共我 2021-01-07 01:38

unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
    \"mfence;rdtsc\" : \"=a\"(lo), \"=d\"(hi) : : \"memory\"
);

mfence

2条回答

被撕碎了的回忆 (楼主)

2021-01-07 02:10
What you need to perform a sensible measurement with rdtsc is a serializing instruction.

As it is well known, a lot of people use cpuid before rdtsc.
rdtsc needs to be serialized from above and below (read: all instructions before it must be retired and it must be retired before the test code starts).

Unfortunately the second condition is often neglected because cpuid is a very bad choice for this task (it clobbers the output of rdtsc).
When looking for alternatives people think that instructions that have a "fence" in their names will do, but this is also untrue. Straight from Intel:

MFENCE does not serialize the instruction stream.

An instruction that is almost serializing and will do in any measurement where previous stores don't need to complete is lfence.

Simply put, lfence makes sure that no new instructions start before any prior instruction completes locally. See this answer of mine for a more detailed explanation on locality.
It also doesn't drain the Store Buffer like mfence does and doesn't clobbers the registers like cpuid does.

So lfence / rdtsc / lfence is a better crafted sequence of instructions than mfence / rdtsc, where mfence is pretty much useless unless you explicitly want the previous stores to be completed before the test begins/ends (but not before rdstc is executed!).

If your test to detect reordering is assert(t2 > t1) then I believe you will test nothing.
Leaving out the return and the call that may or may not prevent the CPU from seeing the second rdtsc in time for a reorder, it is unlikely (though possible!) that the CPU will reorder two rdtsc even if one is right after the other.

Imagine we have a rdtsc2 that is exactly like rdtsc but writes ecx:ebx¹.

Executing
```
rdtsc
rdtsc2
```
is highly likely that ecx:ebx > edx:eax because the CPU has no reason to execute rdtsc2 before rdtsc.
Reordering doesn't mean random ordering, it means look for other instruction if the current one cannot be executed.
But rdtsc has no dependency on any previous instruction, so it's unlikely to be delayed when encountered by the OoO core.
However peculiar internal micro-architectural details may invalidate my thesis, hence the likely word in my previous statement.

¹ We don't need this altered instruction: register renaming will do it, but in case you are not familiar with it, this will help.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...