ARM prefetch workaround

问题

I have a situation where some of the address space is sensitive in that you read it you crash as there is nobody there to respond to that address.

pop {r3,pc}
bx r0

   0:   e8bd8008    pop {r3, pc}
   4:   e12fff10    bx  r0

   8:   bd08        pop {r3, pc}
   a:   4700        bx  r0

The bx was not created by the compiler as an instruction, instead it is the result of a 32 bit constant that didnt fit as an immediate in a single instruction so a pc relative load is setup. This is basically the literal pool. And it happens to have bits that resemble a bx.

Can easily write a test program to generate the issue.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0x12344700)+1);
}

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4802        ldr r0, [pc, #8]    ; (c <fun+0xc>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bd10        pop {r4, pc}
   c:   12344700    eorsne  r4, r4, #0, 14

What appears to be happening is the processor is waiting on data coming back from the pop (ldm) moves onto the next instruction bx r0 in this case, and starts a prefetch at the address in r0. Which hangs the ARM.

As humans we see the pop as an unconditional branch, but the processor does not it keeps going through the pipe.

Prefetching and branch prediction are nothing new (we have the branch predictor off in this case), decades old, and not limited to ARM, but the number of instruction sets that have the PC as GPR and instructions that to some extent treat it as non-special are few.

I am looking for a gcc command line option to prevent this. I cant imagine we are the first ones to see this.

I can of course do this

-march=armv4t


00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4803        ldr r0, [pc, #12]   ; (10 <fun+0x10>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bc10        pop {r4}
   c:   bc02        pop {r1}
   e:   4708        bx  r1
  10:   12344700    eorsne  r4, r4, #0, 14

preventing the problem

Note, not limited to thumb mode, gcc can produce arm code as well for something like this with the literal pool after the pop.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0xe12fff10)+1);
}

00000000 <fun>:
   0:   e92d4010    push    {r4, lr}
   4:   e59f0008    ldr r0, [pc, #8]    ; 14 <fun+0x14>
   8:   ebfffffe    bl  0 <more_fun>
   c:   e2800001    add r0, r0, #1
  10:   e8bd8010    pop {r4, pc}
  14:   e12fff10    bx  r0

Hoping someone knows a generic or arm specific option to do an armv4t like return (pop {r4,lr}; bx lr in arm mode for example) without the baggage or puts a branch to self immediately after a pop pc (seems to solve the problem the pipe is not confused about b as an unconditional branch.

EDIT

ldr pc,[something]
bx rn

also causes a prefetch. which is not going to fall under -march=armv4t. gcc intentionally generates ldrls pc,[]; b somewhere for switch statements and that is fine. Didnt inspect the backend to see if there are other ldr pc,[] instructions generated.

EDIT

Looks like ARM did report this as an Errata, wish I had known that before we spent a month on it...

回答1:

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html has a -mpure-code option, which doesn't put constants in code sections. "This option is only available when generating non-pic code for M-profile targets with the MOVT instruction." so it probably loads constants with a pair of mov-immediate instructions instead of from a constant-pool.

This doesn't fully solve your problem though, since speculative execution of regular instructions (after a conditional branch inside a function) with bogus register contents could still trigger access to unpredictable addresses. Or just the first instruction of another function might be a load, so falling through into another function isn't always safe either.

I can try to shed some light on why this is obscure enough that compilers don't already avoid it.

Normally, speculative execution of instructions that fault is not a problem. The CPU doesn't actually take the fault until it becomes non-speculative. Incorrect (or non-existent) branch prediction can make the CPU do something slow before figuring out the right path, but there should never be a correctness problem.

Normally, speculative loads from memory are allowed in most CPU designs. But memory regions with MMIO registers obviously have to be protected from this. In x86 for example, memory regions can be WB (normal, write-back cacheable, speculative loads allowed), or UC (Uncacheable, no speculative loads). Not to mention write-combining write-through...

You probably need something similar to solve your correctness problem, to stop speculative execution from doing something that will actually explode. This includes speculative instruction-fetch triggered by a speculative bx r0. (Sorry I don't know ARM, so I can't suggest how you'd do that. But this is why it's only a minor performance problem for most systems, even though they have MMIO registers that can't be speculatively read.)

I think it's very unusual to have a setup that lets the CPU do speculative loads from addresses that crash the system instead of just raising an exception when / if they become non-speculative.

we have the branch predictor off in this case

This may be why you're always seeing speculative execution beyond an unconditional branch (the pop), instead of just very rarely.

Nice detective work with using a bx to return, showing that your CPU detects that kind of unconditional branch at decode, but doesn't check the pc bit in a pop. :/

In general, branch prediction has to happen before decode, to avoid fetch bubbles. Given the address of a fetch block, predict the next block-fetch address. Predictions are also generated at the instruction level instead of fetch-block level, for use by later stages of the core (because there can be multiple branch instructions in a block, and you need to know which one is taken).

That's the generic theory. Branch prediction isn't 100%, so you can't count on it to solve your correctness problem.

x86 CPUs can have performance problems where the default prediction for an indirect jmp [mem] or jmp reg is the next instruction. If speculative execution starts something that's slow to cancel (like div on some CPUs) or triggers a slow speculative memory access or TLB miss, it can delay execution of the correct path once it's determined.

So it's recommended (by optimization manuals) to put ud2 (illegal instruction) or int3 (debug trap) or similar after a jmp reg. Or better, put one of the jump-table destinations there so "fall-through" is a correct prediction some of the time. (If the BTB doesn't have a prediction, next-instruction is about the only sane thing it can do.)

x86 doesn't normally mix code with data, though, so this is more likely to be a problem for architectures where literal pools are common. (But loads from bogus addresses can still happen speculatively after indirect branches, or mispredicted normal branches.

e.g. if(address_good) { call table[address](); } could easily mispredict and trigger speculative code-fetch from a bad address. But if the eventual physical address range is marked uncacheable, the load request would stop in the memory controller until it was known to be non-speculative

A return instruction is a type of indirect branch, but it's less likely that a next-instruction prediction is useful. So maybe bx lr stalls because speculative fall-through is less likely to be useful?

pop {pc} (aka LDMIA from the stack pointer) is either not detected as a branch in the decode stage (if it doesn't specifically check the pc bit), or it's treated as generic indirect branch. There are certainly other use-cases for ld into pc as a non-return branch, so detecting it as a probable return would require checking the source register encoding as well as the pc bit.

Maybe there's a special (internal hidden) return-address predictor stack that helps get bx lr predicted correctly every time, when paired with bl? x86 does this, to predict call/ret instructions.

Have you tested if pop {r4, pc} is more efficient than pop {r4, lr} / bx lr? If bx lr is handled specially in more than just avoiding speculative execution of garbage, it might be better to get gcc to do that, instead of having it lead its literal pool with a b instruction or something.

来源：https://stackoverflow.com/questions/46118893/arm-prefetch-workaround

标签

gcc

assembly

arm

armv6