Why is gcc allowed to speculatively load from a struct?

前端未结

关注

 6  770

爱一瞬间的悲伤 2021-02-06 20:29

Example Showing the gcc Optimization and User Code that May Fault

The function \'foo\' in the snippet below will load only one of the struct members A or B; well at le

6条回答

有刺的猬 (楼主)

2021-02-06 21:09
A -> operator on a Pair * implies that there's a whole Pair object fully allocated. (@Hurkyl quotes the standard.)

x86 (like any normal architecture) doesn't have side-effects for accessing normal allocated memory, so x86 memory semantics are compatible with the C abstract machine's semantics for non-volatile memory. Compilers can speculatively load if/when they think that will be a performance win on target microarchitecture they're tuning for in any given situation.

Note that on x86 memory protection operates with page granularity. The compiler could unroll a loop or vectorize with SIMD in a way that reads outside an object, as long as all pages touched contain some bytes of the object. Is it safe to read past the end of a buffer within the same page on x86 and x64?. libc strlen() implementations hand-written in assembly do this, but AFAIK gcc doesn't, instead using scalar loops for the leftover elements at the end of an auto-vectorized loop even where it already aligned the pointers with a (fully unrolled) startup loop. (Perhaps because it would make runtime bounds-checking with valgrind difficult.)

To get the behaviour you were expecting, use a const int * arg.

An array is a single object, but pointers are different from arrays. (Even with inlining into a context where both array elements are known to be accessible, I wasn't able to get gcc to emit code like it does for the struct, so if it's struct code is a win, it's a missed optimization not to do it on arrays when it's also safe.).

In C, you're allowed to pass this function a pointer to a single int, as long as c is non-zero. When compiling for x86, gcc has to assume that it could be pointing to the last int in a page, with the following page unmapped.

Source + gcc and clang output for this and other variations on the Godbolt compiler explorer
```
// exactly equivalent to  const int p[2]
int load_pointer(const int *p, int c) {
  int x;
  if (c)
    x = p[0];
  else
    x = p[1];  // gcc missed optimization: still does an add with c known to be zero
  return c + x;
}

load_pointer:    # gcc7.2 -O3
    test    esi, esi
    jne     .L9
    mov     eax, DWORD PTR [rdi+4]
    add     eax, esi         # missed optimization: esi=0 here so this is a no-op
    ret
.L9:
    mov     eax, DWORD PTR [rdi]
    add     eax, esi
    ret
```
In C, you can pass sort of pass an array object (by reference) to a function, guaranteeing to the function that it's allowed to touch all the memory even if the C abstract machine doesn't. The syntax is int p[static 2]
```
int load_array(const int p[static 2], int c) {
  ... // same body
}
```
But gcc doesn't take advantage, and emits identical code to load_pointer.

Off topic: clang compiles all versions (struct and array) the same way, using a cmov to branchlessly compute a load address.
```
    lea     rax, [rdi + 4]
    test    esi, esi
    cmovne  rax, rdi
    add     esi, dword ptr [rax]
    mov     eax, esi            # missed optimization: mov on the critical path
    ret
```
This isn't necessarily good: it has higher latency than gcc's struct code, because the load address is dependent on a couple extra ALU uops. It is pretty good if both addresses aren't safe to read and a branch would predict poorly.

We can get better code for the same strategy from gcc and clang, using setcc (1 uop with 1c latency on all CPUs except some really ancient ones), instead of cmovcc (2 uops on Intel before Skylake). xor-zeroing is always cheaper than an LEA, too.
```
int load_pointer_v3(const int *p, int c) {
  int offset = (c==0);
  int x = p[offset];
  return c + x;
}

    xor     eax, eax
    test    esi, esi
    sete    al
    add     esi, dword ptr [rdi + 4*rax]
    mov     eax, esi
    ret
```
gcc and clang both put the final mov on the critical path. And on Intel Sandybridge-family, the indexed addressing mode doesn't stay micro-fused with the add. So this would be better, like what it does in the branching version:
```
    xor     eax, eax
    test    esi, esi
    sete    al
    mov     eax, dword ptr [rdi + 4*rax]
    add     eax, esi
    ret
```
Simple addressing modes like [rdi] or [rdi+4] have 1c lower latency than others on Intel SnB-family CPUs, so this might actually be worse latency on Skylake (where cmov is cheap). The test and lea can run in parallel.

After inlining, that final mov probably wouldn't exist, and it could just add into esi.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...