Code alignment in one object file is affecting the performance of a function in another object file

前端 未结 2 927
余生分开走
余生分开走 2020-12-07 03:43

I\'m familiar with data alignment and performance but I\'m rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing

相关标签:
2条回答
  • 2020-12-07 04:12

    Ahhh, code alignment...

    Some basics of code alignment..

    • Most intel architectures fetch 16B worth of instructions per clock.
    • The branch predictor has a larger window and looks at typically double that, per clock. The idea is to get ahead of the instructions fetched.
    • How your code is aligned will dictate which instructions you have available to decode and predict at any given clock (simple code locality argument).
    • Most modern intel architectures cache instructions at various levels (either at the macro instructions level before decoding, or at the micro instruction level after decoding). This eliminates the effects of code alignment, as long as you executing out of the micro/macro cache.
    • Also, most modern intel architectures have some form of loop stream detector that detects loops, again, executing them out of some cache that bypasses the front end fetch mechanism.
    • Some intel architectures are finicky about what they can cache, and what they can't. There are often dependencies on number of instructions/uops/alignment/branches/etc. Alignment may, in some cases, affect what's cached and what's not, and you can create cases where padding can prevent or cause a loop to get cached.
    • To make things even more complicated, the addresses of instructions are also use by the branch predictor. They are used in several ways, including (1) as a lookup into a branch prediction buffer to predict branches, (2) as a key/value to maintain some form of global state of branch behavior for prediction purposes, (3) as a key into determining indirect branch targets, etc.. Therefore, alignment can actually have a pretty huge impact on branch prediction, in some case, due to aliasing or other poor prediction.
    • Some architectures use instruction addresses to determine when to prefetch data, and code alignment can interfere with that, if just the right conditions exist.
    • Aligning loops is not always a good thing to do, depending on how the code is laid out (especially if there's control flow in the loop).

    Having said all that blah blah, your issue could be one of any of these. It's important to look at the disassembly of not just the object, but the executable. You want to see what the final addresses are after everything is linked. Making changes in one object, could affect the alignment/addresses of instructions in another object after linking.

    In some cases, it's near impossible to align your code in such a way as to maximize performance, simply due to so many low level architectural behaviors being hard to control and predict (that doesn't necessarily mean this is always the case). In some cases, your best bet is to have some default alignment strategy (say align all entries on 16B boundaries, and outer loops the same) so as you minimize the amount your performance varies from change-to-change. As a general strategy, aligning function entries is good. Aligning loops that are relatively small is good, as long as you're not adding nops in your execution path.

    Beyond that, I'd need more info/data to pinpoint your exact problem, but thought some of this may help.. Good luck :)

    0 讨论(0)
  • 2020-12-07 04:32

    The confusing nature of the effect (the assembled code doesn't change!) you are seeing is due to section alignment. When using the ALIGN macro in NASM, it actually has two separate effects:

    1. Add 0 or more nop instructions so that the next instruction is aligned to the specified power-of-two boundary.

    2. Issue an implicit SECTALIGN macro call which will set the section alignment directive to alignment amount1.

    The first point is the commonly understood behavior for align. It aligns the loop relatively within the section in the output file.

    The second part is also needed however: imagine your loop was aligned to a 32 byte boundary in the assembled section, but then the runtime loader put your section, in memory, at an address aligned only to 8 bytes: this would make the in-file alignment quite pointless. To fix this, most executable formats allow each section to specify an alignment requirement, and the runtime loader/linker will be sure to load the section at a memory address which respects the requirement.

    That's what the hidden SECTALIGN macro does - it ensures that your ALIGN macro works.

    For your file, there is no difference in the assembled code between ALIGN 16 and ALIGN 32 because the next 16-byte boundary happens to also be the next 32-byte boundary (of course, every other 16-byte boundary is a 32-byte one, so that happens about half the time). The implicit SECTALIGN call is still different though, and that's the one byte difference you see in your hexdump. The 0x20 is decimal 32, and the 0x10 is decimal 16.

    You can verify this with objdump -h <binary>. Here's an example on a binary I aligned to 32 bytes:

    objdump -h loop-test.o
    
    loop-test.o:     file format elf64-x86-64
    
    Sections:
    Idx Name          Size      VMA               LMA               File off  Algn
      0 .text         0000d18a  0000000000000000  0000000000000000  00000180  2**5
                      CONTENTS, ALLOC, LOAD, READONLY, CODE
    

    The 2**5 in the Algn column is the 32-byte alignment. With 16-byte alignment this changes to 2**4.

    Now it should be clear what happens - aligning the first function in your example changes the section alignment, but not the assembly. When you linked your program together, the linker will merge the various .text sections and pick the highest alignment.

    At runtime, then this causes the code to be aligned to a 32-byte boundary - but this doesn't affect the first function, because it isn't alignment sensitive. Since the linker has merged your object files into one section, the larger alignment of 32 changes the alignment of every function (and instruction) in the section, including your other method, and so it changes the performance of your other function, which is alignment-sensitive.


    1To be precise, SECTALIGN only changes the section alignment if the current section alignment is less than the specified amount - so the final section alignment will be the same as the largest SECTALIGN directive in the section.

    0 讨论(0)
提交回复
热议问题