Why does loop alignment on 32 byte make code faster?

点点圈 提交于 2019-12-03 11:07:30

This doesn't answer point 2 (return a == d || b == d || c == d; being the same speed as return false). That's still a maybe-interesting question, since that must compile multiple uop-cache lines of instructions.


The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 32]

That optimization-guide advice is a very general guideline, and definitely doesn't mean that larger never helps. Usually it doesn't, and padding to 32 would be more likely to hurt than help. (I-cache misses, ITLB misses, and more code bytes to load from disk).

In fact, 16B alignment is rarely necessary, especially on CPUs with a uop cache. For a small loop that can run from the loop buffer, it alignment is usually totally irrelevant.


16B is still not bad as a broad recommendation, but it doesn't tell you everything you need to know to understand one specific case on a couple of specific CPUs.

Compilers usually default to aligning loop branches and function entry-points, but usually don't align other branch targets. The cost of executing a NOP (and code bloat) is often larger than the likely cost of an unaligned non-loop branch target.


Code alignment has some direct and some indirect effects. The direct effects include the uop cache on Intel SnB-family. For example, see Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.

Another section of Intel's optimization manual goes into some detail about how the uop cache works:

2.3.2.2 Decoded ICache:

  • All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).
  • A multi micro-op instruction cannot be split across Ways.
  • An instruction which turns on the MSROM consumes an entire Way.
  • Up to two branches are allowed per Way.
  • A pair of macro-fused instructions is kept as one micro-op.

See also Agner Fog's microarch guide. He adds:

  • An unconditional jump or call always ends a μop cache line
  • lots of other stuff that that probably isn't relevant here.

Also, that if your code doesn't fit in the uop cache, it can't run from the loop buffer.


The indirect effects of alignment include:

  • larger/smaller code-size (L1I cache misses, TLB). Not relevant for your test
  • which branches alias each other in the BTB (Branch Target Buffer).

If I remove volatiles from one.cpp, code becomes slower. Why is that?

The larger instructions push the last instruction into the loop across a 32B boundary:

 59e:   83 eb 01                sub    ebx,0x1
 5a1:   75 dd                   jne    580 <main+0x20>

So if you aren't running from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.

If sub/jne macro-fuses, this might not apply. And I think only crossing a 64B boundary would break macro-fusion.

Also, those aren't real addresses. Have you checked what the addresses are after linking? There could be a 64B boundary there after linking, if the text section has less than 64B alignment.


Sorry I haven't actually tested this to say more about this specific case. The point is, when you bottleneck on the front-end from stuff like having a call/ret inside a tight loop, alignment becomes important and can get is extremely complex. Boundary-crossing or not for all future instructions is affected. Do not expect it to be simple. If you've read my other answers, you'll know I'm not usually the kind of person to say "it's too complicated to fully explain", but alignment can be that way.

See also Code alignment in one object file is affecting the performance of a function in another object file

In your case, make sure tiny functions inline. Use link-time optimization if your code-base has any important tiny functions in separate .c files instead of in a .h where they can inline. Or change your code to put them in a .h.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!