Jump in the middle of basic block

拥有回忆 提交于 2019-12-11 05:49:08

问题


A basic block is defined as a sequence of (non-jump) instructions ending with a jump (direct or indirect) instruction. The jump target address should be the start of another basic block. Consider I have the following assembly code :

106ac:       ba00000f        blt     106f0 <main+0xb8>
106b0:       e3099410        movw    r9, #37904      ; 0x9410
106b4:       e3409001        movt    r9, #1
106b8:       e79f9009        ldr     r9, [pc, r9]
106bc:       e3a06000        mov     r6, #0
106c0:       e1a0a008        mov     sl, r8
106c4:       e30993fc        movw    r9, #37884      ; 0x93fc
106c8:       e3409001        movt    r9, #1
106cc:       e79f9009        ldr     r9, [pc, r9]
106d0:       e5894000        str     r4, [r9]
106d4:       e7941105        ldr     r1, [r4, r5, lsl #2]
106d8:       e1a00007        mov     r0, r7
106dc:       e12fff31        blx     r1
106e0:       e0806006        add     r6, r0, r6
106e4:       e25aa001        subs    sl, sl, #1
106e8:       e287700d        add     r7, r7, #13
106ec:       1afffff4        bne     106c4 <main+0x8c>
106f0:       e30993d0        movw    r9, #37840      ; 0x93d0
106f4:       e3409001        movt    r9, #1

bb1

106a4:       ...
106ac:       ba00000f        blt     106f0 <main+0xb8>

The first basic block bb1 has a target address which is the start of bb4.

bb2

106b0:       e3099410        movw    r9, #37904      ; 0x9410
....        All other instructions
106c4:       e30993fc        movw    r9, #37884      ; 0x93fc
....        All other instructions
106d8:       e1a00007        mov     r0, r7
106dc:       e12fff31        blx     r1

The second basic block bb2 has an indirect branch so the target address is known only at runtime.

bb3

106e0:       e0806006        add     r6, r0, r6
106e4:       e25aa001        subs    sl, sl, #1
106e8:       e287700d        add     r7, r7, #13
106ec:       1afffff4        bne     106c4 <main+0x8c>

The third basic block has a target address which is not the start of another basic block but it is in the middle of bb2. According to the definition of a basic block, it is not possible. But, in practice, I am seeing this behavior (jumps in the middle of basic blocks) in multiple places. How to explain this behavior ? Is it possible to force a compiler (LLVM) to generate code that does not jump anywhere else except at the beginning of a basic block ?

bb4

106f0:       e30993d0        movw    r9, #37840      ; 0x93d0
106f4:       e3409001        movt    r9, #1
....
Ends with a branch (direct or indirect)

I am generating basic blocks using a tool (SPEDI) and the compiler used is LLVM (CLANG front end) and the targeted architecture is ARM V7 Cortex-A9.


回答1:


Basic blocks are the nodes in the control flow graph, which means that once control enters the block, it can't do anything else apart from running through the whole block and exiting it. It doesn't mean that they have to start or end with a jump instruction. For better understanding refer to this excerpt from Wikipedia:

Because of its construction procedure, in a CFG, every edge A→B has the property that:

outdegree(A) > 1 or indegree(B) > 1 (or both).

The CFG can thus be obtained, at least conceptually, by starting from the program's (full) flow graph—i.e. the graph in which every node represents an individual instruction—and performing an edge contraction for every edge that falsifies the predicate above, i.e. contracting every edge whose source has a single exit and whose destination has a single entry.

According to this definition I would analyze code between 106b0 and 106ec differently: one block B1 from 106b0 to 106c0, and one block B2 from 106c4 to 106ec. This code is a loop, B1 is the setup part of the loop and B2 is the body.

In ARM a bl instruction such as the one at 106dc is a function call, meaning that control will be passed to the called function but then returned to the instruction right after the bl. So if we're only constructing the CFG for the calling function I wouldn't consider this instruction as a block boundary. If we're doing the CFG for the whole program there should be an edge towards the called function here and then another edge back from the called function to the next instruction.




回答2:


A basic block doesn't contain branch targets, as Samuel's answer explains. The branch targets into blocks of instructions are also boundaries between basic blocks.

You're generating this code with a compiler, so use clang -O3 -S foo.c to get the compiler's asm output with labels on branch targets.

Compiling all the way to an object file and then disassembling that means you'd need a disassembler to put labels back onto the targets of all the branches it finds when disassembling. Agner Fog's x86 disassembler, objconv does this. Maybe there's something similar for ARM, but I don't think GNU binutils objdump -d has an option for that.

I don't have ARM clang installed, but the output is probably very similar to x86. For example, a very simple function that will compile with a branch:

int sa, sb;
void foo(int a, int b) {
    if (a>b) {
        sb = b;
    }
    sa = a;
}

Compiled for x86 on the Godbolt compiler explorer with clang5.0 -O3. (Godbolt has ARM-gcc installed, but not ARM-clang)

foo(int, int):                               # @foo(int, int)
        cmp     edi, esi
        jle     .LBB0_2
        mov     dword ptr [rip + sb], esi
.LBB0_2:
        mov     dword ptr [rip + sa], edi
        ret

There are 3 basic blocks here: cmp/jle, the first mov, and the 2nd mov+ret. The 2nd block has no label, because it starts after the fall-through of a conditional branch.

The .LBB0_2 label name is auto-generated. The .L means its a "local" label (no symbol in the symbol-table of the object file; it's for internal use while assembling this file only). The BB stands for Basic Block. I think BB0_2 means it's basic block #2 (counting from 0) in the first function. (Duplicating the function with a different name gives us a .LBB1_2 label.) Within a function, different labels have a different last number.


Clang even labels all the basic blocks in comments:

On Godbolt, click the // button to disable hiding comment lines. Then you get:

foo(int, int):                               # @foo(int, int)
# BB#0:
        #DEBUG_VALUE: foo:a <- %EDI
        #DEBUG_VALUE: foo:b <- %ESI
        cmp     edi, esi
        jle     .LBB0_2
# BB#1:
        #DEBUG_VALUE: foo:b <- %ESI
        #DEBUG_VALUE: foo:a <- %EDI
        mov     dword ptr [rip + sb], esi
.LBB0_2:
        #DEBUG_VALUE: foo:b <- %ESI
        #DEBUG_VALUE: foo:a <- %EDI
        mov     dword ptr [rip + sa], edi
        ret

i.e. basic blocks that aren't branch targets get a comment to delimit + number them, instead of a .L local label. It also shows you which C variables are in which registers on entry to the BB.



来源:https://stackoverflow.com/questions/49612818/jump-in-the-middle-of-basic-block

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!