I\'m trying to implement an algorithm in assembly (MASM64, Windows, x64) using jump tables. Basic idea is: there are 3 different types of operations I need to do with data. Th
RIP-relative addressing only works when there are no other registers in the addressing mode.
[table + rcx*8]
can only be encoded in x86-64 machine code as [disp32 + rcx*8]
, and thus only works with non-large addresses that fit in a 32-bit signed absolute address. Windows can apparently support this with LARGEADDRESSAWARE:NO
, like on Linux compiling with -no-pie to solve the same problem.
MacOS has no workaround for it, you can't use 64-bit absolute addressing at all there. Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array shows how to index a static array using a RIP-relative lea
to get the table address into a register while avoiding 32-bit absolute addresses.
Your jump tables themselves are fine: they use 64-bit absolute addresses which can be relocated anywhere in virtual address space. (Using load-time fixups after ASLR.)
I think you have one too many levels of indirection. Since you already load a function pointer into a register, you should be using jmp r10
not jmp [r10]
. Doing all the loads into registers up front gets them in the pipeline sooner, before any possible branch mispredicts, so is maybe a good idea if you have lots of registers to spare.
Much better would be inlining some of the later blocks, if they're small, because the blocks reachable by any given RCX value aren't reachable any other way. So it would be much better to inline all of func_21
and func_31
into func_11
, and so on for func_12
. You might use assembler macros to make this easier.
Actually what matters is just that the jump at the end of func_11
always goes to func_21
. It's fine of there are other ways to reach that block, e.g. from other indirect branches that skip table 1. That's no reason for func_11
not to fall into it; it only limits what optimizations you can make between those 2 blocks if func_21
still has to be a valid entry point for execution paths that didn't fall through from func_11
.
But anyway, you can implement your code like this. If you do optimize it, you can remove the later dispatching steps and the corresponding loads.
I think this is valid MASM syntax. If not, it should still be clear what the desired machine-code is.
lea rax, [jumpTable1] ; RIP-relative by default in MASM, like GAS [RIP + jumpTable1] or NASM [rel jumpTable1]
; The other tables are at assemble-time-constant small offsets from RAX
mov r10, [rax + rcx*8 + jumpTable3 - jumpTable1]
mov r11, [rax + rcx*8 + jumpTable2 - jumpTable1]
jmp [rax + rcx*8]
func_11:
...
jmp r10 ; TODO: inline func_21 or at least use jmp func_21
; you can use macros to help with either of those
Or if you only want to tie up a single register for one table, maybe use:
lea r10, [jumpTable1] ; RIP-relative LEA
lea r10, [r10 + rcx*8] ; address of the function pointer we want
jmp [r10]
align 8
func_11:
...
jmp [r10 + jumpTable2 - jumpTable1] ; same index in another table
align 8
func_12:
...
jmp [r10 + jumpTable3 - jumpTable1] ; same index in *another* table
This takes full advantage of the known static offsets between tables.
Cache locality for the jump targets
In your matrix of jump targets, any single usage strides down a "column" to follow some chain of jumps. It would obviously be better to transpose your layout so that one chain of jumps goes along a "row", so so all the targets come from the same cache line.
i.e. arrange your table so func_11
and 21
can end with jmp [r10+8]
, and then jmp [r10+16]
, instead of + some offset between tables, for improved spatial locality. L1d load latency is only a few cycles so there's not much extra delay for the CPU in check the correctness of branch prediction, vs. if you'd loaded into registers ahead of the first indirect branch. (I'm considering the case where the first branch mispredicts, so OoO exec can't "see" the memory-indirect jmp until after the correct path for that starts to issue.)
You can also store 32-bit (or 16 or 8-bit) offsets relative to some reference address that's near the jump targets, or relative to the table itself.
For example, look at what GCC does when compiling switch
jump tables in position-independent code, even for targets that do allow runtime fixups of absolute addresses.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 includes a testcase; see it on Godbolt with GCC's MASM-style .intel_syntax. It uses a movsxd
load from the table, then add rax, rdx
/ jmp rax
. The table entries are things like dd L27 - L4
and dd L25 - L4
(where those are label names, giving the distance from a jump target to the "anchor" L4).
(Also related for that case https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585).