问题
[A lot of text incoming, since I want to detail my question as best as I can.]
I'm in the process of optimizing hand-written ARM assembly code for a Cortex-M0. The board I'm using is the STMicro STM32F0Discovery, which has an STM32F051R8 controller. The controller is running at 48 MHz.
Unfortunately, I'm getting some pretty strange cycle counts when doing optimizations.
For example, adding a single nop
into a loop in my code should add 2 cycles in total (looped 2 times). However, doing so adds around 1800 extra cycles. Now, when I add in an extra nop
(so 2 nop
s in total), the cycle count does increase by the expected 4 cycles.
I get similar strange results for the example piece of code below. The example code shows, for the top excerpt: c = 25 * a + 5 * b
. The bottom excerpt is c = 5 * (5 * a + b)
. So, the bottom one should be faster, since it requires 1 less mov
. However, changing this:
movs r4, #25
muls r3, r4, r3
add r2, r3
ldrb r3, [r6, #RoundStep]
movs r4, #5
muls r3, r4, r3
add r2, r3
into this:
movs r4, #5
muls r3, r4, r3
ldrb r5, [r6, #RoundStep]
add r3, r5
muls r3, r4, r3
add r2, r3
does not increase the speed by the expected 1 cycle, instead, it decreases the speed by more or less 1000 cycles...
To count the cycles, I'm using the SysTick counter, counting down from its max value, and increasing an overflow counter on overflow interrupt. The code that I'm using for this is more or less the same as this excerpt from the ARM website, but rewritten for the Cortex-M0 that I'm using. My code is sufficiently fast that an overflow interrupt never happens during measurements.
Now, I was starting to think that the counter was giving me wrong values, so I also wrote some code for a TI Stellaris LaunchPad I had lying around. This is a Cortex-M4F running at 80 MHz. The code measures the number of cycles a certain pin is held high. Of course, the clock of the M0 and that of the M4F aren't running in sync, so the reported cycle counts vary a bit, which I "fix" by taking a very low weighted exponential average of the measured cycle counts (avg = 0.995 * avg + 0.005 * curCycles
) and repeating the measurement 10000 times.
The time measured by the M4F is the same as measured by the M0, so "unfortunately" it seems the SysTick counter is working just fine in the M0.
At first I thought these extra delays were caused by pipeline stalls, but on one hand the M0 seems to be too simple for that, and on the other I can't find any detailed info on the M0's pipeline, so I can't verify.
So, my question is: what is going on here? Why does adding a single nop
make my function take an extra 1000 cycles/loop, but do two nop
s only increase the cycle count by 2? How come removing instructions makes my code execute slower?
回答1:
The mul
instruction can be multiple cycles of the ALU pipe. Your transformation of c = 25 * a + 5 * b
into c = 5 * (5 * a + b)
requires one less mov
. However, the load-store stage of the pipeline over-lays with the ALU. These are often separate stages and with a ldrb
instruction you can get mov
instructions for free. Also, depending on the values, the muls
may execute faster; specifically, the top bytes being zero often result in a sorter multiply cycle. There are far less data dependencies in the first version; instruction n has no registers in common with n+1. This is a basic requirement to allow pipe-lining.
Compare,
ldrb r5, [r6, #RoundStep] ; 2 cycles
add r3, r5 ; must block for r5 to load (1 cycle)
with,
ldrb r3, [r6, #RoundStep] ; 2 cycles
movs r4, #5 ; may run in parallel with above.
So even though you may add up the instruction count and have less code, it can turn out that a larger alternate will run faster due to pipe-lining or instruction scheduling.
The 2nd version may be faster if you can relocate the ldrb
towards the beginning of the routine.
来源:https://stackoverflow.com/questions/18960524/erratic-cycle-counts-on-arm-cortex-m0