Jump table implementation in MASM x64?

前端 未结 1 521
隐瞒了意图╮
隐瞒了意图╮ 2021-01-24 06:42

I\'m trying to implement an algorithm in assembly (MASM64, Windows, x64) using jump tables. Basic idea is: there are 3 different types of operations I need to do with data. Th

相关标签:
1条回答
  • 2021-01-24 07:20

    RIP-relative addressing only works when there are no other registers in the addressing mode.

    [table + rcx*8] can only be encoded in x86-64 machine code as [disp32 + rcx*8], and thus only works with non-large addresses that fit in a 32-bit signed absolute address. Windows can apparently support this with LARGEADDRESSAWARE:NO, like on Linux compiling with -no-pie to solve the same problem.

    MacOS has no workaround for it, you can't use 64-bit absolute addressing at all there. Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array shows how to index a static array using a RIP-relative lea to get the table address into a register while avoiding 32-bit absolute addresses.

    Your jump tables themselves are fine: they use 64-bit absolute addresses which can be relocated anywhere in virtual address space. (Using load-time fixups after ASLR.)


    I think you have one too many levels of indirection. Since you already load a function pointer into a register, you should be using jmp r10 not jmp [r10]. Doing all the loads into registers up front gets them in the pipeline sooner, before any possible branch mispredicts, so is maybe a good idea if you have lots of registers to spare.

    Much better would be inlining some of the later blocks, if they're small, because the blocks reachable by any given RCX value aren't reachable any other way. So it would be much better to inline all of func_21 and func_31 into func_11, and so on for func_12. You might use assembler macros to make this easier.

    Actually what matters is just that the jump at the end of func_11 always goes to func_21. It's fine of there are other ways to reach that block, e.g. from other indirect branches that skip table 1. That's no reason for func_11 not to fall into it; it only limits what optimizations you can make between those 2 blocks if func_21 still has to be a valid entry point for execution paths that didn't fall through from func_11.


    But anyway, you can implement your code like this. If you do optimize it, you can remove the later dispatching steps and the corresponding loads.

    I think this is valid MASM syntax. If not, it should still be clear what the desired machine-code is.

        lea    rax,  [jumpTable1]          ; RIP-relative by default in MASM, like GAS [RIP + jumpTable1] or NASM [rel jumpTable1]
    
        ; The other tables are at assemble-time-constant small offsets from RAX
        mov    r10,  [rax + rcx*8 + jumpTable3 - jumpTable1]
        mov    r11,  [rax + rcx*8 + jumpTable2 - jumpTable1]
        jmp    [rax + rcx*8]
    
    
    func_11:
        ...
        jmp  r10         ; TODO: inline func_21  or at least use  jmp func_21
                         ;  you can use macros to help with either of those
    

    Or if you only want to tie up a single register for one table, maybe use:

        lea    r10,  [jumpTable1]    ; RIP-relative LEA
        lea    r10,  [r10 + rcx*8]   ; address of the function pointer we want
        jmp    [r10]
    
    align 8
    func_11:
        ...
        jmp   [r10 + jumpTable2 - jumpTable1]    ; same index in another table
    
    
    align 8
    func_12:
        ...
        jmp   [r10 + jumpTable3 - jumpTable1]    ; same index in *another* table
    

    This takes full advantage of the known static offsets between tables.


    Cache locality for the jump targets

    In your matrix of jump targets, any single usage strides down a "column" to follow some chain of jumps. It would obviously be better to transpose your layout so that one chain of jumps goes along a "row", so so all the targets come from the same cache line.

    i.e. arrange your table so func_11 and 21 can end with jmp [r10+8], and then jmp [r10+16], instead of + some offset between tables, for improved spatial locality. L1d load latency is only a few cycles so there's not much extra delay for the CPU in check the correctness of branch prediction, vs. if you'd loaded into registers ahead of the first indirect branch. (I'm considering the case where the first branch mispredicts, so OoO exec can't "see" the memory-indirect jmp until after the correct path for that starts to issue.)


    Avoiding 64-bit absolute addresses:

    You can also store 32-bit (or 16 or 8-bit) offsets relative to some reference address that's near the jump targets, or relative to the table itself.

    For example, look at what GCC does when compiling switch jump tables in position-independent code, even for targets that do allow runtime fixups of absolute addresses.

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 includes a testcase; see it on Godbolt with GCC's MASM-style .intel_syntax. It uses a movsxd load from the table, then add rax, rdx / jmp rax. The table entries are things like dd L27 - L4 and dd L25 - L4 (where those are label names, giving the distance from a jump target to the "anchor" L4).

    (Also related for that case https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585).

    0 讨论(0)
提交回复
热议问题