[question in bold at the bottom]
When an assembler generates a binary encoding it needs to decide whether to make each branch long or short, short being better, if p
j_random_hacker's argument that Start Small is optimal for the simplified case where there is no padding sounds reasonable. However, it's not very useful outside optimized-for-size functions. Real asm does have ALIGN
directives, and it does make a difference.
Here's the simplest example I could construct of a case where Start Small doesn't give an optimal result (tested with NASM and YASM). Use jz near .target0
to force a long encoding, moving another_function:
32 bytes earlier and reducing padding within func
.
func:
.target0: ; anywhere nearby
jz .target0 ; (B0) short encoding is easily possible
.target1:
times 10 vpermilps xmm14, xmm15, [rdi+12345]
; A long B0 doesn't push this past a 32B boundary, so short or long B0 doesn't matter
ALIGN 32
.loop:
times 12 xor r15d,r15d
jz .target1 ; (B1) short encoding only possible if B0 is long
times 18 xor r15d,r15d
ret ; A long B1 does push this just past a 32B boundary.
ALIGN 32
another_function:
xor eax,eax
ret
If B0 is short, then B1 has to be long to reach target1.
If B0 is long, it pushes target1 closer to B1, allowing a short encoding to reach.
So at most one of B0 and B1 can have a short encoding, but it matters which one is short. A short B0 means 3 more bytes of alignment padding, with no saving in code-size. A long B0 allowing a short B1 does save total code size. In my example, I've illustrated the simplest way that can happen: by pushing the end of the code after B1 past the boundary of the next alignment. It could also affect other branches, e.g. requiring a long encoding for a branch to .loop
.
Start-Small result: B0 short, B1 long. (Their initial first-pass states.) Start-Small doesn't try lengthening B0 and shortening B1 to see if it reduces total padding, or just padding that gets executed (ideally weighted by trip count).
A 4-byte NOP before .loop
, and 31 bytes of NOPs before another_func
, so it starts at 0x400160
instead of the 0x400140
that we get from using jz near .target0
which leads to a short encoding for B1.
Note that a long encoding for B0 itself is not the only way to achieve a short encoding for B1. A longer-than-necessary encoding for any of the instructions before .target1
could also do the trick. (e.g. a 4B displacement or immediate, instead of a 1B. Or an unnecessary or repeated prefix.)
Unfortunately, no assembler I know of supports padding this way; only with nop
. What methods can be used to efficiently extend instruction length on modern x86?
Often, there isn't even a jump over the long-NOP
at the start of a loop, so more padding is potentially worse for performance (if multiple NOPs are needed, or the code runs on a CPU like Atom or Silvermont that is really slow with lots of prefixes, which got used because the assembler wasn't tuning for Silvermont).
Note that compiler output rarely has jumps between functions (usually just for tail-call optimization). x86 doesn't have a short encoding for call
. Hand-written asm can do whatever it wants, but spaghetti code is (hopefully?) still uncommon on a large scale.
I think it's likely that the BDO problem can be broken into multiple independent sub-problems for most asm source files, usually each function being a separate problem. This means that even non-polynomial-complexity algorithms may be viable.
Some shortcuts to help break up the problem will help: e.g. detect when a long encoding is definitely needed, even if all intervening branches use a short encoding. This will allow breaking dependencies between sub-problems when the only thing connecting them was a tail-call between two distant functions.
I'm not sure where to actually start making an algorithm to find a globally optimal solution. If we're willing to consider expanding other instructions to move branch targets, the search space is quite huge. However, I think we only have to consider branches that cross alignment-padding.
The possible cases are:
Doing a good job with this might be easier if we embed some knowledge of microarchitectural optimization into the assembler: e.g. always try to have branch targets start near the start of 16B insn fetch blocks, and definitely not right at the end. An Intel uop cache line can only cache uops from within one 32B block, so 32B boundaries are important for the uop cache. L1 I$ line size is 64B, and the page size is 4kiB. (The assembler won't know which code is hot and which is cold, though. Having hot code span two pages might be worse than slightly larger code-size.)
Having a multi-uop instruction at the beginning of an instruction decode group is also much better than having it anywhere else, for Intel and AMD. (Less so for Intel CPUs with a uop cache). Figuring out which path the CPU will take through the code most of the time, and where the instruction decode boundaries will be, is probably well beyond what an assembler can manage.
Here's a proof that, in the absence of the anomalous jumps mentioned by harold in the comments, the "start small" algorithm is optimal:
First, let's establish that "start small" always produces a feasible solution -- that is, one that doesn't contain any short encoding of a too-long jump. The algorithm essentially amounts to repeatedly asking the question "Is it feasible yet?" and lengthening some jump encoding if not, so clearly if it terminates, then the solution it produces must be feasible. Since each iteration lengthens some jump, and no jump is ever lengthened more than once, this algorithm must eventually terminate after at most nJumps iterations, so the solution must be feasible.
Now suppose to the contrary that the algorithm can produce a suboptimal solution X. Let Y be some optimal solution. We can represent a solution as the subset of jump instructions that are lengthened. We know that |X \ Y| >= 1 -- that is, that there is at least 1 instruction lengthening in X that is not also in Y -- because otherwise X would be a subset of Y, and since Y is optimal by assumption and X is known to be feasible, it would follow that X = Y, meaning that X would itself be an optimal solution, which would contradict our original assumption about X.
From among the instructions in X \ Y, choose i to be the one that was lengthened first by the "start small" algorithm, and let Z be the subset of Y (and of X) consisting of all instructions already lengthened by the algorithm prior to this time. Since the "start small" algorithm decided to lengthen i's encoding, it must have been the case that by that point in time (i.e., after lengthening all the instructions in Z), i's jump displacement was too big for a short encoding. (Note that while some of the lengthenings in Z may have pushed i's jump displacement past the critical point, this is by no means necessary -- maybe i's displacement was above the threshold from the beginning. All we can know, and all we need to know, is that i's jump displacement was above the threshold by the time Z had finished being processed.) But now look back at the optimal solution Y, and note that none of the other lengthenings in Y -- i.e., in Y \ Z -- are able to reduce i's jump displacement back down, so, since i's displacement is above the threshold but its encoding is not lengthened by Y, Y is not even feasible! An infeasible solution cannot be optimal, so the existence of such a non-lengthened instruction i in Y would contradict the assumption that Y is optimal -- meaning that no such i can exist.