Using time stamp counter and clock_gettime for cache miss

情到浓时终转凉″ 提交于 2019-11-28 14:39:00

You broke Hadi's code by removing the read of tmp at the end, so it gets optimized away by gcc. There is no load in your timed region. C statements are not asm instructions.

Look at the compiler-generated asm, e.g. on the Godbolt compiler explorer. You should always be doing this when you're trying to microbenchmark really low-level stuff like this, especially if your timing results are unexpected.

    lfence
    clflush [rcx]
    lfence

    lfence
    rdtsc                     # start of first timed region
    lfence
       # nothing because tmp=array[0] optimized away.
    lfence
    mov     rcx, rax
    sal     rdx, 32
    or      rcx, rdx
    rdtsc                     # end of first timed region
    mov     edi, OFFSET FLAT:.LC2
    lfence

    sal     rdx, 32
    or      rax, rdx
    sub     rax, rcx
    mov     rsi, rax
    mov     rbx, rax
    xor     eax, eax
    call    printf

You get a compiler warning about an unused variable from -Wall, but you can silence that in ways that still optimize away. e.g. your tmp++ doesn't make tmp available to anything outside the function, so it still optimizes away. Silencing the warning is not sufficient: print the value, return the value, or assign it to a volatile variable outside the timed region. (Or use inline asm volatile to require the compiler to have it in a register at some point. Chandler Carruth's CppCon2015 talk about using perf mentions some tricks: https://www.youtube.com/watch?v=nXaxk27zwlk)


In GNU C (at least with gcc and clang -O3), you can force a read by casting to (volatile int*), like this:

// int tmp = array[0];           // replace this
(void) *(volatile int*)array;    // with this

The (void) is to avoid a warning for evaluating an expression in a void context, like writing x;.

This kind of looks like strict-aliasing UB, but my understanding is that gcc defines this behaviour. The Linux kernel casts a pointer to add a volatile qualifier in its ACCESS_ONCE macro, so it's used in one of the codebases that gcc definitely cares about supporting. You could always make the whole array volatile; it doesn't matter if initialization of it can't auto-vectorize.

Anyway, this compiles to

    # gcc8.2 -O3
    lfence
    rdtsc
    lfence
    mov     rcx, rax
    sal     rdx, 32
    mov     eax, DWORD PTR [rsp]    # the load which wasn't there before.
    lfence
    or      rcx, rdx
    rdtsc
    mov     edi, OFFSET FLAT:.LC2
    lfence

Then you don't have to mess around with making sure tmp is used, or with worrying about dead-store elimination, CSE, or constant-propagation. In practice the _mm_mfence() or something else in Hadi's original answer included enough memory-barriering to make gcc actually redo the load for the cache-miss + cache-hit case, but it easily could have optimized away one of the reloads.


Note that this can result in asm that loads into a register but never reads it. Current CPUs do still wait for the result (especially if there's an lfence), but overwriting the result could let a hypothetical CPU discard the load and not wait for it. (It's up to the compiler whether it happens to do something else with the register before the next lfence, like mov part of the rdtsc result there.)

This is tricky / unlikely for hardware to do, because the CPU has to be ready for exceptions, see discussion in comments here.) RDRAND reportedly does work that way (What is the latency and throughput of the RDRAND instruction on Ivy Bridge?), but that's probably a special case.

I tested this myself on Skylake by adding an xor eax,eax to the compiler's asm output, right after the mov eax, DWORD PTR [rsp], to kill the result of the cache-miss load. That didn't affect the timing.

Still, this is a potential gotcha with discarding the results of a volatile load; future CPUs might behave differently. It might be better to sum the load results (outside the timed region) and assign them at the end to a volatile int sink, in case future CPUs start discarding uops that produce unread results. But still use volatile for the loads to make sure they happen where you want them.


Also don't forget to do some kind of warm-up loop to get the CPU up to max speed, unless you want to measure the cache-miss execution time at idle clock speed. It looks like your empty timed region is taking a lot of reference cycles, so your CPU was probably clocked down pretty slow.


So, how exactly cache attacks, e.g. meltdown and spectre, overcome such issue? Basically they have to disable hw prefetcher since they try to measure adjacent addresses in order to find if they are hit or miss.

The cache-read side-channel as part of a Meltdown or Spectre attack typically uses a stride large enough that HW prefetching can't detect the access pattern. e.g. on separate pages instead of contiguous lines. One of the first google hits for meltdown cache read prefetch stride was https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2, which uses a stride of 4096. It could be tougher for Spectre, because your stride is at the mercy of the "gadgets" you can find in the target process.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!