问题
A basic block is defined as a sequence of (non-jump) instructions ending with a jump (direct or indirect) instruction. The jump target address should be the start of another basic block. Consider I have the following assembly code :
106ac: ba00000f blt 106f0 <main+0xb8>
106b0: e3099410 movw r9, #37904 ; 0x9410
106b4: e3409001 movt r9, #1
106b8: e79f9009 ldr r9, [pc, r9]
106bc: e3a06000 mov r6, #0
106c0: e1a0a008 mov sl, r8
106c4: e30993fc movw r9, #37884 ; 0x93fc
106c8: e3409001 movt r9, #1
106cc: e79f9009 ldr r9, [pc, r9]
106d0: e5894000 str r4, [r9]
106d4: e7941105 ldr r1, [r4, r5, lsl #2]
106d8: e1a00007 mov r0, r7
106dc: e12fff31 blx r1
106e0: e0806006 add r6, r0, r6
106e4: e25aa001 subs sl, sl, #1
106e8: e287700d add r7, r7, #13
106ec: 1afffff4 bne 106c4 <main+0x8c>
106f0: e30993d0 movw r9, #37840 ; 0x93d0
106f4: e3409001 movt r9, #1
bb1
106a4: ...
106ac: ba00000f blt 106f0 <main+0xb8>
The first basic block bb1 has a target address which is the start of bb4.
bb2
106b0: e3099410 movw r9, #37904 ; 0x9410
.... All other instructions
106c4: e30993fc movw r9, #37884 ; 0x93fc
.... All other instructions
106d8: e1a00007 mov r0, r7
106dc: e12fff31 blx r1
The second basic block bb2 has an indirect branch so the target address is known only at runtime.
bb3
106e0: e0806006 add r6, r0, r6
106e4: e25aa001 subs sl, sl, #1
106e8: e287700d add r7, r7, #13
106ec: 1afffff4 bne 106c4 <main+0x8c>
The third basic block has a target address which is not the start of another basic block but it is in the middle of bb2. According to the definition of a basic block, it is not possible. But, in practice, I am seeing this behavior (jumps in the middle of basic blocks) in multiple places. How to explain this behavior ? Is it possible to force a compiler (LLVM) to generate code that does not jump anywhere else except at the beginning of a basic block ?
bb4
106f0: e30993d0 movw r9, #37840 ; 0x93d0
106f4: e3409001 movt r9, #1
....
Ends with a branch (direct or indirect)
I am generating basic blocks using a tool (SPEDI) and the compiler used is LLVM (CLANG front end) and the targeted architecture is ARM V7 Cortex-A9.
回答1:
Basic blocks are the nodes in the control flow graph, which means that once control enters the block, it can't do anything else apart from running through the whole block and exiting it. It doesn't mean that they have to start or end with a jump instruction. For better understanding refer to this excerpt from Wikipedia:
Because of its construction procedure, in a CFG, every edge A→B has the property that:
outdegree(A) > 1 or indegree(B) > 1 (or both).
The CFG can thus be obtained, at least conceptually, by starting from the program's (full) flow graph—i.e. the graph in which every node represents an individual instruction—and performing an edge contraction for every edge that falsifies the predicate above, i.e. contracting every edge whose source has a single exit and whose destination has a single entry.
According to this definition I would analyze code between 106b0 and 106ec differently: one block B1 from 106b0 to 106c0, and one block B2 from 106c4 to 106ec. This code is a loop, B1 is the setup part of the loop and B2 is the body.
In ARM a bl
instruction such as the one at 106dc is a function call, meaning that control will be passed to the called function but then returned to the instruction right after the bl
. So if we're only constructing the CFG for the calling function I wouldn't consider this instruction as a block boundary. If we're doing the CFG for the whole program there should be an edge towards the called function here and then another edge back from the called function to the next instruction.
回答2:
A basic block doesn't contain branch targets, as Samuel's answer explains. The branch targets into blocks of instructions are also boundaries between basic blocks.
You're generating this code with a compiler, so use clang -O3 -S foo.c
to get the compiler's asm output with labels on branch targets.
Compiling all the way to an object file and then disassembling that means you'd need a disassembler to put labels back onto the targets of all the branches it finds when disassembling. Agner Fog's x86 disassembler, objconv does this. Maybe there's something similar for ARM, but I don't think GNU binutils objdump -d
has an option for that.
I don't have ARM clang installed, but the output is probably very similar to x86. For example, a very simple function that will compile with a branch:
int sa, sb;
void foo(int a, int b) {
if (a>b) {
sb = b;
}
sa = a;
}
Compiled for x86 on the Godbolt compiler explorer with clang5.0 -O3. (Godbolt has ARM-gcc installed, but not ARM-clang)
foo(int, int): # @foo(int, int)
cmp edi, esi
jle .LBB0_2
mov dword ptr [rip + sb], esi
.LBB0_2:
mov dword ptr [rip + sa], edi
ret
There are 3 basic blocks here: cmp/jle
, the first mov
, and the 2nd mov
+ret
. The 2nd block has no label, because it starts after the fall-through of a conditional branch.
The .LBB0_2
label name is auto-generated. The .L
means its a "local" label (no symbol in the symbol-table of the object file; it's for internal use while assembling this file only). The BB
stands for Basic Block. I think BB0_2
means it's basic block #2 (counting from 0) in the first function. (Duplicating the function with a different name gives us a .LBB1_2
label.) Within a function, different labels have a different last number.
Clang even labels all the basic blocks in comments:
On Godbolt, click the //
button to disable hiding comment lines. Then you get:
foo(int, int): # @foo(int, int)
# BB#0:
#DEBUG_VALUE: foo:a <- %EDI
#DEBUG_VALUE: foo:b <- %ESI
cmp edi, esi
jle .LBB0_2
# BB#1:
#DEBUG_VALUE: foo:b <- %ESI
#DEBUG_VALUE: foo:a <- %EDI
mov dword ptr [rip + sb], esi
.LBB0_2:
#DEBUG_VALUE: foo:b <- %ESI
#DEBUG_VALUE: foo:a <- %EDI
mov dword ptr [rip + sa], edi
ret
i.e. basic blocks that aren't branch targets get a comment to delimit + number them, instead of a .L
local label. It also shows you which C variables are in which registers on entry to the BB.
来源:https://stackoverflow.com/questions/49612818/jump-in-the-middle-of-basic-block