问题
I just need to alter the code so that it does the same basic function but more optimised, basically I think the filter loop is the main piece of code that can be changed as I feel there are too many instructions in there, but don't know where to start with it. I am working with the Cortex M3 and Thumb 2.
I have tried tampering with the filter loop, so that I could add the previous number stored in the register and divide that by 8, but I do not know how to really execute that.
; Perform in-place filtering of data supplied in memory
; the filter to be applied is a non-recursive filter of the form
; y[0] = x[-2]/8 + x[-1]/8 + x[0]/4 + x[1]/8 + x[2]/8
; set up the exception addresses
THUMB
AREA RESET, CODE, READONLY
EXPORT __Vectors
EXPORT Reset_Handler
__Vectors
DCD 0x00180000 ; top of the stack
DCD Reset_Handler ; reset vector - where the program starts
num_words EQU (end_source-source)/4 ; number of input values
filter_length EQU 5 ; number of filter taps (values)
AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
; set up the filter parameters
LDR r0,=source ; point to the start of the area of memory holding inputs
MOV r1,#num_words ; get the number of input values
MOV r2,#filter_length ; get the number of filter taps
LDR r3,=dest ; point to the start of the area of memory holding outputs
; find out how many times the filter needs to be applied
SUBS r4,r1,r2 ; find the number of applications of the filter needed, less 1
BMI exit ; give up if there is insufficient data for any filtering
; apply the filter
filter_loop
LDMIA r0,{r5-r9} ; get the next 5 data values to be filtered
ADD r5,r5,r9 ; sum x[-2] with x[2]
ADD r6,r6,r8 ; sum x[-1] with x[1]
ADD r9,r5,r6 ; sum x[-2]+x[2] with x[-1]+x[1]
ADD r7,r7,r9,LSR #1 ; sum x[0] with (x[-2]+x[2]+x[-1]+x[1])/2
MOV r7,r7,LSR #2 ; form (x[0] + (x[-2]+x[-1]+x[1]+x[2])/2)/4
STR r7,[r3],#4 ; save calculated filtered value, move to next output data item
ADD r0,r0,#4 ; move to start of next 5 input data values
SUBS r4,r4,#1 ; move on to next set of 5 inputs
BPL filter_loop ; continue until last set of 5 inputs reached
; execute an endless loop once done
exit
B exit
AREA 2a_ROData, DATA, READONLY
source ; some saw tooth data to filter - should blunt the sharp edges
DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
end_source
AREA 2a_RWData, DATA, READWRITE
dest ; copy to this area of memory
SPACE end_source-source
end_dest
END
END
I expect there to be a more efficient way to run the code, weather that reduces the overall size of the code or speeds up the execution time of the cycles, as long as it does the same thing. Any help would be appreciated.
回答1:
For code-size, try to only use registers r0..r7 which can be used in short 16-bit encodings.
Also, versions of instructions with flag-setting often have 16-bit encodings when the non-flag-setting version requires 32-bit. e.g.
adds r0, #4
is 16-bit vs. 32-bitadd r0, #4
movs r7,r7,LSR #2
is 16-bit vs. 32-bitMOV r7,r7,LSR #2
movs r2,#filter_length
is 16-bit vs. 32-bitMOV r2,#filter_length
. (non-tiny immediates like#88
still need a 32-bit Thumb2mov
)stmia r3!, {r5}
(with write-back) is 16-bit vs. 32-bitstr r7, [r3], #4
with post-increment.
See the Thumb code-size section of my answer on your earlier question: How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?. Look at the disassembly for your code and look for 32-bit instructions, and check why they're 32-bit, and look for a way to make them 16-bit. This is just super-basic Thumb optimization that you can always do.
r1
and r2
aren't even used inside your loop, and r4 = r1-r2
is an assemble-time constant that you're computing at runtime with 3 instructions... So that's obviously insane vs. movs r4, #num_words - filter_length
.
If those are supposed to be inputs that aren't known at assemble time for your real code (maybe the same function is sometimes used on different inputs?), then reuse the registers that are "dead" after calculating a loop counter. It's kind of clunky that you accept pointers in r0 and r3, so you then have r2
and r4-r7
free if you use r1
for the loop counter, or r1-r2
and r5-r7
free if you use r4
.
I chose to use r1
for the loop counter. This is disassembly from my version (arm-none-eabi-gcc -g -c -mthumb -mcpu=cortex-m3 arm-filter.S && arm-none-eabi-objdump -drwC arm-filter.o
)
@@ Saving code size without any other changes
00000000 <function>:
0: 480a ldr r0, [pc, #40] ; (2c <exit+0x4>)
2: f05f 0158 movs.w r1, #88 ; 0x58
6: 2205 movs r2, #5
8: 4b09 ldr r3, [pc, #36] ; (30 <exit+0x8>)
a: 1a89 subs r1, r1, r2
c: d40c bmi.n 28 <exit>
0000000e <filter_loop>:
e: e890 00f4 ldmia.w r0, {r2, r4, r5, r6, r7}
12: 443a add r2, r7
14: 4434 add r4, r6
16: 4414 add r4, r2
18: eb15 0554 adds.w r5, r5, r4, lsr #1
1c: 08ad lsrs r5, r5, #2
1e: c320 stmia r3!, {r5}
20: 3004 adds r0, #4
22: 3901 subs r1, #1
24: d5f3 bpl.n e <filter_loop>
00000026 <exit>:
26: e7fe b.n 26 <exit>
Cortex-M3 doesn't have NEON, but there is data reuse between outputs. With unrolling, we can definitely reuse the load results, and some of the "inner" add
results. Maybe with a sliding window to subtract the word that's no longer part of the total and add in the new one.
But with the middle element being "special", we have two 2-element windows on either side, unless we have enough spare bits at the top to add x[0]
twice and then right shift by 3 without overflowing. Then you don't even need to unroll, just load 1 element / adjust sliding window and recalc the middle / store 1 element.
(My first version of this answer was based on a misunderstanding of the code. I might update with a speed optimization later, but for now editing to remove wrong stuff.)
来源:https://stackoverflow.com/questions/55734308/how-do-i-optimise-a-filter-loop-for-cortex-m3