问题
I have a C Function which tries to copy a framebuffer to FSMC RAM.
The functions eats the frame rate of the game loop to 10FPS. I would like to know how to analyze the disassembled function, should I count each instruction cycle ? I want to know where the CPU spend its time, in which part. I'm sure that the algorithm is also a problem, because its O(N^2)
The C Function is:
void LCD_Flip()
{
u8 i,j;
LCD_SetCursor(0x00, 0x0000);
LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
LCD_WriteIndex(0x0022);
for(j=0;j<fbHeight;j++)
{
for(i=0;i<240;i++)
{
u16 color = frameBuffer[i+j*fbWidth];
LCD_WriteData(color);
}
}
}
Disassembled function:
08000fd0 <LCD_Flip>:
8000fd0: b580 push {r7, lr}
8000fd2: b082 sub sp, #8
8000fd4: af00 add r7, sp, #0
8000fd6: 2000 movs r0, #0
8000fd8: 2100 movs r1, #0
8000fda: f7ff fde9 bl 8000bb0 <LCD_SetCursor>
8000fde: 2050 movs r0, #80 ; 0x50
8000fe0: 2100 movs r1, #0
8000fe2: f7ff feb5 bl 8000d50 <LCD_WriteRegister>
8000fe6: 2051 movs r0, #81 ; 0x51
8000fe8: 21ef movs r1, #239 ; 0xef
8000fea: f7ff feb1 bl 8000d50 <LCD_WriteRegister>
8000fee: 2052 movs r0, #82 ; 0x52
8000ff0: 2100 movs r1, #0
8000ff2: f7ff fead bl 8000d50 <LCD_WriteRegister>
8000ff6: 2053 movs r0, #83 ; 0x53
8000ff8: f240 113f movw r1, #319 ; 0x13f
8000ffc: f7ff fea8 bl 8000d50 <LCD_WriteRegister>
8001000: 2022 movs r0, #34 ; 0x22
8001002: f7ff fe87 bl 8000d14 <LCD_WriteIndex>
8001006: 2300 movs r3, #0
8001008: 71bb strb r3, [r7, #6]
800100a: e01b b.n 8001044 <LCD_Flip+0x74>
800100c: 2300 movs r3, #0
800100e: 71fb strb r3, [r7, #7]
8001010: e012 b.n 8001038 <LCD_Flip+0x68>
8001012: 79f9 ldrb r1, [r7, #7]
8001014: 79ba ldrb r2, [r7, #6]
8001016: 4613 mov r3, r2
8001018: 011b lsls r3, r3, #4
800101a: 1a9b subs r3, r3, r2
800101c: 011b lsls r3, r3, #4
800101e: 1a9b subs r3, r3, r2
8001020: 18ca adds r2, r1, r3
8001022: 4b0b ldr r3, [pc, #44] ; (8001050 <LCD_Flip+0x80>)
8001024: f833 3012 ldrh.w r3, [r3, r2, lsl #1]
8001028: 80bb strh r3, [r7, #4]
800102a: 88bb ldrh r3, [r7, #4]
800102c: 4618 mov r0, r3
800102e: f7ff fe7f bl 8000d30 <LCD_WriteData>
8001032: 79fb ldrb r3, [r7, #7]
8001034: 3301 adds r3, #1
8001036: 71fb strb r3, [r7, #7]
8001038: 79fb ldrb r3, [r7, #7]
800103a: 2bef cmp r3, #239 ; 0xef
800103c: d9e9 bls.n 8001012 <LCD_Flip+0x42>
800103e: 79bb ldrb r3, [r7, #6]
8001040: 3301 adds r3, #1
8001042: 71bb strb r3, [r7, #6]
8001044: 79bb ldrb r3, [r7, #6]
8001046: 2b63 cmp r3, #99 ; 0x63
8001048: d9e0 bls.n 800100c <LCD_Flip+0x3c>
800104a: 3708 adds r7, #8
800104c: 46bd mov sp, r7
800104e: bd80 pop {r7, pc}
回答1:
Not exactly answering your question, but I see you aspire for fast execution of the loops.
Here are some tips from the book: 'ARM System Developer's Guide: Designing and Optimizing System Software (The Morgan Kaufmann Series in Computer Architecture and Design)' http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745
Chapter 5 contains section named 'C looping structures'. Here is the summary of the section:
Writing Loops Efficiently
- Use loops that count down to zero. Then the compiler does not need to allocate a register to hold the termination value, and the comparison with zero is free.
- Use unsigned loop counters by default and the continuation condition i!=0 rather than i>0. This will ensure that the loop overhead is only two instructions.
- Use do-while loops rather than for loops when you know the loop will iterate at least once. This saves the compiler checking to see if the loop count is zero.
- Unroll important loops to reduce the loop overhead. Do not overunroll. If the loop overhead is small as a proportion of the total, then unrolling will increase code size and hurt the performance of the cache.
- Try to arrange that the number of elements in arrays are multiples of four or eight. You can then unroll loops easily by two, four, or eight times without worrying about the leftover array elements.
Based on the summary, your inner loop might look as below.
uinsigned int i = 240/4; // Use unsigned loop counters by default
// and the continuation condition i!=0
do
{
// Unroll important loops to reduce the loop overhead
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
}
while ( i != 0 ) // Use do-while loops rather than for
// loops when you know the loop will
// iterate at least once
You might want to experiment also with 'pragmas', e.g. :
#pragma Otime
http://www.keil.com/support/man/docs/armcc/armcc_chr1359124989673.htm
#pragma unroll(n)
http://www.keil.com/support/man/docs/armcc/armcc_chr1359124992247.htm
And as it is Cortex-M3 try to find out if MCU hardware gives you chance to arrange the code/data to take advantage of its Harvard architecture (I experienced 30% speed increase).
see here my other answer
Maybe not everything may be applicable in your application (filling a buffer in reverse order). I just wanted to draw your attention to the book and possible points for optimization.
回答2:
You should start by compiling the C code with speed optimizations enabled. The disassembled code you provide appears to be storing the i
and j
counters on the stack, which adds 3 load/store operations to the inner loop. You might also want to inline LCD_WriteData
in the inner loop.
On the other hand, if you are really writing to the LCD in the inner loop then the performance may be limited by that interface.
回答3:
Just to purely reduce the number of looped operations, you could do something like so. I did make some assumptions which may not be accurate: You had a loop that went from i=0:239
, and I am assuming that fbWidth
is the same as 240
. If this isn't true then the loop would have to be more complicated.
void LCD_Flip()
{
u16 i,limit = fbHeight+fbWidth;
// We will use a precalculated limit and one single loop
LCD_SetCursor(0x00, 0x0000);
LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
LCD_WriteIndex(0x0022);
// Single loop from 0:limit-1 takes care of having to do an
// x,y conversion each iteration.
for(i=0;i<limit;j++)
{
u16 color = frameBuffer[i];
LCD_WriteData(color);
}
}
This strips out the two loops in favor of a single for loop with only one conditional test per iteration. On top of that, the indexing into frameBuffer
is now linear, so we don't need to multiply out the width to go from x,y to linear storage. Your loop iterations won't have been reduced (i.e. it is still O(N)
with N = height*width
), but the number of instructions should have been reduced.
As @Joe Hass noted in his answer, this may not actually help at all if you are really limited by the LCD interface. Depending on which STM32 you're using, the FSMC may not be particularly fast, and I can't imagine the LCD controller would be very fast either.
来源:https://stackoverflow.com/questions/23253358/optimizing-arm-cortex-m3-code