I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (<
I'm by no means an expert in this area, but I seem to remember that modern processors are quite sensitive when it comes to branch prediction. The algorithms used to predict the branches are (or at least were back in the days I wrote assembler code) based on several properties of the code, including the distance of a target and on the direction.
The scenario which comes to mind is small loops. When the branch was going backwards and the distance was not too far, the branch predicition was optimizing for this case as all the small loops are done this way. The same rules might come into play when you swap the location of add
and work
in the generated code or when the position of both slightly changes.
That said, I have no idea how to verify that and I just wanted to let you know that this might be something you want to look into.
I'm adding this post-accept to point out that the effects of alignment on overall performance of programs - including big ones - has been studied. For example, this article (and I believe a version of this also appeared in CACM) shows how link order and OS environment size changes alone were sufficient to shift performance significantly. They attribute this to alignment of "hot loops".
This paper, titled "Producing wrong data without doing anything obviously wrong!" says that inadvertent experimental bias due to nearly uncontrollable differences in program running environments probably renders many benchmark results meaningless.
I think you're encountering a different angle on the same observation.
For performance-critical code, this is a pretty good argument for systems that assess the environment at installation or run time and choose the local best among differently optimized versions of key routines.
I think that you can obtain the same result as what you did:
I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines:
… by using -O2 -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1
. I have been compiling everything with these options, that were faster than plain -O2
everytime I bothered to measure, for 15 years.
Also, for a completely different context (including a different compiler), I noticed that the situation is similar: the option that is supposed to “optimize code size rather than speed” optimizes for code size and speed.
If I guess correctly, these are paddings for stack alignment.
No, this has nothing to do with the stack, the NOPs that are generated by default and that options -falign-*=1 prevent are for code alignment.
According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster but apparently this optimization backfired in my case.
Is it the padding that is the culprit in this case? Why and how?
It is very likely that the padding is the culprit. The reason padding is felt to be necessary and is useful in some cases is that code is typically fetched in lines of 16 bytes (see Agner Fog's optimization resources for the details, which vary by model of processor). Aligning a function, loop, or label on a 16-bytes boundary means that the chances are statistically increased that one fewer lines will be necessary to contain the function or loop. Obviously, it backfires because these NOPs reduce code density and therefore cache efficiency. In the case of loops and label, the NOPs may even need to be executed once (when execution arrives to the loop/label normally, as opposed to from a jump).
If your program is bounded by the CODE L1 cache, then optimizing for size suddenly starts to pay out.
When last I checked, the compiler is not smart enough to figure this out in all cases.
In your case, -O3 probably generates code enough for two cache lines, but -Os fits in one cache line.
By default compilers optimize for "average" processor. Since different processors favor different instruction sequences, compiler optimizations enabled by -O2
might benefit average processor, but decrease performance on your particular processor (and the same applies to -Os
). If you try the same example on different processors, you will find that on some of them benefit from -O2
while other are more favorable to -Os
optimizations.
Here are the results for time ./test 0 0
on several processors (user time reported):
Processor (System-on-Chip) Compiler Time (-O2) Time (-Os) Fastest
AMD Opteron 8350 gcc-4.8.1 0.704s 0.896s -O2
AMD FX-6300 gcc-4.8.1 0.392s 0.340s -Os
AMD E2-1800 gcc-4.7.2 0.740s 0.832s -O2
Intel Xeon E5405 gcc-4.8.1 0.603s 0.804s -O2
Intel Xeon E5-2603 gcc-4.4.7 1.121s 1.122s -
Intel Core i3-3217U gcc-4.6.4 0.709s 0.709s -
Intel Core i3-3217U gcc-4.7.3 0.708s 0.822s -O2
Intel Core i3-3217U gcc-4.8.1 0.708s 0.944s -O2
Intel Core i7-4770K gcc-4.8.1 0.296s 0.288s -Os
Intel Atom 330 gcc-4.8.1 2.003s 2.007s -O2
ARM 1176JZF-S (Broadcom BCM2835) gcc-4.6.3 3.470s 3.480s -O2
ARM Cortex-A8 (TI OMAP DM3730) gcc-4.6.3 2.727s 2.727s -
ARM Cortex-A9 (TI OMAP 4460) gcc-4.6.3 1.648s 1.648s -
ARM Cortex-A9 (Samsung Exynos 4412) gcc-4.6.3 1.250s 1.250s -
ARM Cortex-A15 (Samsung Exynos 5250) gcc-4.7.2 0.700s 0.700s -
Qualcomm Snapdragon APQ8060A gcc-4.8 1.53s 1.52s -Os
In some cases you can alleviate the effect of disadvantageous optimizations by asking gcc
to optimize for your particular processor (using options -mtune=native
or -march=native
):
Processor Compiler Time (-O2 -mtune=native) Time (-Os -mtune=native)
AMD FX-6300 gcc-4.8.1 0.340s 0.340s
AMD E2-1800 gcc-4.7.2 0.740s 0.832s
Intel Xeon E5405 gcc-4.8.1 0.603s 0.803s
Intel Core i7-4770K gcc-4.8.1 0.296s 0.288s
Update: on Ivy Bridge-based Core i3 three versions of gcc
(4.6.4
, 4.7.3
, and 4.8.1
) produce binaries with significantly different performance, but the assembly code has only subtle variations. So far, I have no explanation of this fact.
Assembly from gcc-4.6.4 -Os
(executes in 0.709 secs):
00000000004004d2 <_ZL3addRKiS0_.isra.0>:
4004d2: 8d 04 37 lea eax,[rdi+rsi*1]
4004d5: c3 ret
00000000004004d6 <_ZL4workii>:
4004d6: 41 55 push r13
4004d8: 41 89 fd mov r13d,edi
4004db: 41 54 push r12
4004dd: 41 89 f4 mov r12d,esi
4004e0: 55 push rbp
4004e1: bd 00 c2 eb 0b mov ebp,0xbebc200
4004e6: 53 push rbx
4004e7: 31 db xor ebx,ebx
4004e9: 41 8d 34 1c lea esi,[r12+rbx*1]
4004ed: 41 8d 7c 1d 00 lea edi,[r13+rbx*1+0x0]
4004f2: e8 db ff ff ff call 4004d2 <_ZL3addRKiS0_.isra.0>
4004f7: 01 c3 add ebx,eax
4004f9: ff cd dec ebp
4004fb: 75 ec jne 4004e9 <_ZL4workii+0x13>
4004fd: 89 d8 mov eax,ebx
4004ff: 5b pop rbx
400500: 5d pop rbp
400501: 41 5c pop r12
400503: 41 5d pop r13
400505: c3 ret
Assembly from gcc-4.7.3 -Os
(executes in 0.822 secs):
00000000004004fa <_ZL3addRKiS0_.isra.0>:
4004fa: 8d 04 37 lea eax,[rdi+rsi*1]
4004fd: c3 ret
00000000004004fe <_ZL4workii>:
4004fe: 41 55 push r13
400500: 41 89 f5 mov r13d,esi
400503: 41 54 push r12
400505: 41 89 fc mov r12d,edi
400508: 55 push rbp
400509: bd 00 c2 eb 0b mov ebp,0xbebc200
40050e: 53 push rbx
40050f: 31 db xor ebx,ebx
400511: 41 8d 74 1d 00 lea esi,[r13+rbx*1+0x0]
400516: 41 8d 3c 1c lea edi,[r12+rbx*1]
40051a: e8 db ff ff ff call 4004fa <_ZL3addRKiS0_.isra.0>
40051f: 01 c3 add ebx,eax
400521: ff cd dec ebp
400523: 75 ec jne 400511 <_ZL4workii+0x13>
400525: 89 d8 mov eax,ebx
400527: 5b pop rbx
400528: 5d pop rbp
400529: 41 5c pop r12
40052b: 41 5d pop r13
40052d: c3 ret
Assembly from gcc-4.8.1 -Os
(executes in 0.994 secs):
00000000004004fd <_ZL3addRKiS0_.isra.0>:
4004fd: 8d 04 37 lea eax,[rdi+rsi*1]
400500: c3 ret
0000000000400501 <_ZL4workii>:
400501: 41 55 push r13
400503: 41 89 f5 mov r13d,esi
400506: 41 54 push r12
400508: 41 89 fc mov r12d,edi
40050b: 55 push rbp
40050c: bd 00 c2 eb 0b mov ebp,0xbebc200
400511: 53 push rbx
400512: 31 db xor ebx,ebx
400514: 41 8d 74 1d 00 lea esi,[r13+rbx*1+0x0]
400519: 41 8d 3c 1c lea edi,[r12+rbx*1]
40051d: e8 db ff ff ff call 4004fd <_ZL3addRKiS0_.isra.0>
400522: 01 c3 add ebx,eax
400524: ff cd dec ebp
400526: 75 ec jne 400514 <_ZL4workii+0x13>
400528: 89 d8 mov eax,ebx
40052a: 5b pop rbx
40052b: 5d pop rbp
40052c: 41 5c pop r12
40052e: 41 5d pop r13
400530: c3 ret
My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).
Short answer:
Is it the padding that is the culprit in this case? Why and how?
It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-*
flags in the first place.
I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3
does the appropriate alignment and the generated code does not show this weird behavior.
Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.
The noise it makes pretty much makes timing micro-optimizations impossible.
How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source codes?
Simply by telling gcc to do the right alignment:
g++ -O2 -falign-functions=16 -falign-loops=16
Long answer:
The code will run slower if:
an XX
byte boundary cuts add()
in the middle (XX
being machine dependent).
if the call to add()
has to jump over an XX
byte boundary and the target is not aligned.
if add()
is not aligned.
if the loop is not aligned.
The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os
(executes in 0.994 secs):
00000000004004fd <_ZL3addRKiS0_.isra.0>:
4004fd: 8d 04 37 lea eax,[rdi+rsi*1]
400500: c3
a 256 byte boundary cuts add()
right in the middle and neither add()
nor the loop is aligned. Surprise, surprise, this is the slowest case!
In case gcc-4.7.3 -Os
(executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add()
is cut):
00000000004004fa <_ZL3addRKiS0_.isra.0>:
4004fa: 8d 04 37 lea eax,[rdi+rsi*1]
4004fd: c3 ret
[...]
40051a: e8 db ff ff ff call 4004fa <_ZL3addRKiS0_.isra.0>
Nothing is aligned, and the call to add()
has to jump over the 256 byte boundary. This code is the second slowest.
In case gcc-4.6.4 -Os
(executes in 0.709 secs), although nothing is aligned, the call to add()
doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:
4004f2: e8 db ff ff ff call 4004d2 <_ZL3addRKiS0_.isra.0>
4004f7: 01 c3 add ebx,eax
4004f9: ff cd dec ebp
4004fb: 75 ec jne 4004e9 <_ZL4workii+0x13>
This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.
Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16
then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer
flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32
or any multiples of 16, the code is not sensitive to that either.
I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3) and I have been wondering ever since why.
A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os
instead of -O2
), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.
Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add()
fail?
Consider this:
// add.cpp
int add(const int& x, const int& y) {
return x + y;
}
and in a separate file:
// main.cpp
int add(const int& x, const int& y);
const int LOOP_BOUND = 200000000;
__attribute__((noinline))
static int work(int xval, int yval) {
int sum(0);
for (int i=0; i<LOOP_BOUND; ++i) {
int x(xval+sum);
int y(yval+sum);
int z = add(x, y);
sum += z;
}
return sum;
}
int main(int , char* argv[]) {
int result = work(*argv[1], *argv[2]);
return result;
}
and compiled as: g++ -O2 add.cpp main.cpp
.
gcc won't inline add()
!
That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp
, that is, if I perform link time optimization, the code runs in 0.19s!
(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).