The function \'foo\' in the snippet below will load only one of the struct members A or B; well at le
A ->
operator on a Pair *
implies that there's a whole Pair
object fully allocated. (@Hurkyl quotes the standard.)
x86 (like any normal architecture) doesn't have side-effects for accessing normal allocated memory, so x86 memory semantics are compatible with the C abstract machine's semantics for non-volatile
memory. Compilers can speculatively load if/when they think that will be a performance win on target microarchitecture they're tuning for in any given situation.
Note that on x86 memory protection operates with page granularity. The compiler could unroll a loop or vectorize with SIMD in a way that reads outside an object, as long as all pages touched contain some bytes of the object. Is it safe to read past the end of a buffer within the same page on x86 and x64?. libc strlen()
implementations hand-written in assembly do this, but AFAIK gcc doesn't, instead using scalar loops for the leftover elements at the end of an auto-vectorized loop even where it already aligned the pointers with a (fully unrolled) startup loop. (Perhaps because it would make runtime bounds-checking with valgrind
difficult.)
To get the behaviour you were expecting, use a const int *
arg.
An array is a single object, but pointers are different from arrays. (Even with inlining into a context where both array elements are known to be accessible, I wasn't able to get gcc to emit code like it does for the struct, so if it's struct code is a win, it's a missed optimization not to do it on arrays when it's also safe.).
In C, you're allowed to pass this function a pointer to a single int
, as long as c
is non-zero. When compiling for x86, gcc has to assume that it could be pointing to the last int
in a page, with the following page unmapped.
Source + gcc and clang output for this and other variations on the Godbolt compiler explorer
// exactly equivalent to const int p[2]
int load_pointer(const int *p, int c) {
int x;
if (c)
x = p[0];
else
x = p[1]; // gcc missed optimization: still does an add with c known to be zero
return c + x;
}
load_pointer: # gcc7.2 -O3
test esi, esi
jne .L9
mov eax, DWORD PTR [rdi+4]
add eax, esi # missed optimization: esi=0 here so this is a no-op
ret
.L9:
mov eax, DWORD PTR [rdi]
add eax, esi
ret
In C, you can pass sort of pass an array object (by reference) to a function, guaranteeing to the function that it's allowed to touch all the memory even if the C abstract machine doesn't. The syntax is int p[static 2]
int load_array(const int p[static 2], int c) {
... // same body
}
But gcc doesn't take advantage, and emits identical code to load_pointer.
Off topic: clang compiles all versions (struct and array) the same way, using a cmov
to branchlessly compute a load address.
lea rax, [rdi + 4]
test esi, esi
cmovne rax, rdi
add esi, dword ptr [rax]
mov eax, esi # missed optimization: mov on the critical path
ret
This isn't necessarily good: it has higher latency than gcc's struct code, because the load address is dependent on a couple extra ALU uops. It is pretty good if both addresses aren't safe to read and a branch would predict poorly.
We can get better code for the same strategy from gcc and clang, using setcc
(1 uop with 1c latency on all CPUs except some really ancient ones), instead of cmovcc
(2 uops on Intel before Skylake). xor
-zeroing is always cheaper than an LEA, too.
int load_pointer_v3(const int *p, int c) {
int offset = (c==0);
int x = p[offset];
return c + x;
}
xor eax, eax
test esi, esi
sete al
add esi, dword ptr [rdi + 4*rax]
mov eax, esi
ret
gcc and clang both put the final mov
on the critical path. And on Intel Sandybridge-family, the indexed addressing mode doesn't stay micro-fused with the add
. So this would be better, like what it does in the branching version:
xor eax, eax
test esi, esi
sete al
mov eax, dword ptr [rdi + 4*rax]
add eax, esi
ret
Simple addressing modes like [rdi]
or [rdi+4]
have 1c lower latency than others on Intel SnB-family CPUs, so this might actually be worse latency on Skylake (where cmov
is cheap). The test
and lea
can run in parallel.
After inlining, that final mov
probably wouldn't exist, and it could just add
into esi
.