Optimising this C (AVR) code

I have an interrupt handler that just isn't running fast enough for what I want to do. Basically I'm using it to generate sine waves by outputting a value from a look up table to a PORT on an AVR microncontroller but, unfortunately, this isn't happening fast enough for me to get the frequency of the wave that I want. I was told that I should look at implementing it in assembly as the compiler generated assembly might be slightly inefficient and may be able to be optimised but after looking at the assembly code I really can't see what I could do any better.

This is the C code:

const uint8_t amplitudes60[60] = {127, 140, 153, 166, 176, 191, 202, 212, 221, 230, 237, 243, 248, 251, 253, 254, 253, 251, 248, 243, 237, 230, 221, 212, 202, 191, 179, 166, 153, 140, 127, 114, 101, 88, 75, 63, 52, 42, 33, 24, 17, 11, 6, 3, 1, 0, 1, 3, 6, 11, 17, 24, 33, 42, 52, 63, 75, 88, 101, 114};
const uint8_t amplitudes13[13] = {127, 176,  221, 248,  202, 153, 101, 52, 17,  1, 6,  33,  75};
const uint8_t amplitudes10[10] = {127, 176,   248,  202, 101, 52, 17,  1,  33,  75};

volatile uint8_t numOfAmps = 60;
volatile uint8_t *amplitudes = amplitudes60;
volatile uint8_t amplitudePlace = 0; 

ISR(TIMER1_COMPA_vect) 
{
    PORTD = amplitudes[amplitudePlace];

    amplitudePlace++; 

    if(amplitudePlace == numOfAmps)
    {
        amplitudePlace = 0;
    }

}

amplitudes and numOfAmps are both changed by another interrupt routine that runs much slower than this one (it basically is run to change the frequencies that are being played). At the end of the day I won't be using those exact arrays but it will be a very similar set up. I'll most likely have an array with 60 values and another with just 30. This is because I'm building a frequency sweeper and at the lower frequencies I can afford to give it more samples as I have more clock cycles to play with but at the higher frequencies I'm very much strapped for time.

I do realise that I can get it to work with a lower sampling rate but I don't want to go under 30 samples per period. I don't think having the pointer to the array makes it any slower as the assembly to get a value from an array and the assembly to get a value from a pointer to an array seems the same (which makes sense).

At the highest frequency that I have to produce I've been told I should be able to get it working with about 30 samples per sine wave period. At the moment with 30 samples the fastest it will run is at about half the required max frequency which I think means that my interrupt needs to run twice as fast.

So that code there when simulated takes 65 cycles to complete. Again, I've been told I should be able to get it down to about 30 cycles at best.

This is the ASM code produced, with my thinking of what each line does next to it:

ISR(TIMER1_COMPA_vect) 
{
push    r1
push    r0
in      r0, 0x3f        ; save status reg
push    r0
eor     r1, r1      ; generates a 0 in r1, used much later
push    r24
push    r25
push    r30
push    r31         ; all regs saved


PORTD = amplitudes[amplitudePlace];
lds     r24, 0x00C8     ; r24 <- amplitudePlace I’m pretty sure
lds     r30, 0x00B4 ; these two lines load in the address of the 
lds     r31, 0x00B5 ; array which would explain why it’d a 16 bit number
                    ; if the atmega8 uses 16 bit addresses


add     r30, r24            ; aha, this must be getting the ADDRESS OF THE element 
adc     r31, r1             ; at amplitudePlace in the array.  

ld      r24, Z              ; Z low is r30, makes sense. I think this is loading
                            ; the memory located at the address in r30/r31 and
                            ; putting it into r24

out     0x12, r24           ; fairly sure this is putting the amplitude into PORTD

amplitudePlace++; 
lds     r24, 0x011C     ; r24 <- amplitudePlace
subi    r24, 0xFF       ; subi is subtract imediate.. 0xFF = 255 so I’m
                        ; thinking with an 8 bit value x, x+1 = x - 255;
                        ; I might just trust that the compiler knows what it’s 
                        ; doing here rather than try to change it to an ADDI 

sts     0x011C, r24     ; puts the new value back to the address of the
                        ; variable

if(amplitudePlace == numOfAmps)
lds     r25, 0x00C8 ; r24 <- amplitudePlace
lds     r24, 0x00B3 ; r25 <- numOfAmps 

cp      r24, r24        ; compares them 
brne    .+4             ; 0xdc <__vector_6+0x54>
        {
                amplitudePlace = 0;
                    sts     0x011C, r1 ; oh, this is why r1 was set to 0 earlier
        }


}

pop     r31             ; restores the registers
pop     r30
pop     r25
pop     r24
pop     r19
pop     r18
pop     r0
out     0x3f, r0        ; 63
pop     r0
pop     r1
reti

Apart from maybe using less registers in the interrupt so that I have less push/pops I really can't see where this assembly code is inefficient.

My only other thought is maybe the if statement could be gotten rid of if I could work out how to get a n bit int datatype in C so that the number will wrap around when it reaches the end? By this I mean I would have 2^n - 1 samples and then have the amplitudePlace variable just keep counting up so that when it reaches 2^n it'll overflow and will be reset to zero.

I did try simulating the code without the if bit completely and while it did improve the speed, it only took about 10 cycles off so that it was at about 55 cycles for one execution which still isn't quite fast enough unfortunately so I do need to optimise the code even further which is hard considering without that it's only 2 lines!!

My only other real thought is to see if I can store the static look up tables somewhere that takes less clock cycles to access? The LDS instructions it uses to access the array I think all take 2 cycles so I probably wouldn't really be saving much time there but at this stage I'm willing to try anything.

I'm totally at a loss of where to go from here. I can't see how I could make my C code any more efficient but I'm only fairly new to this sort of thing so I could be missing something. I would love any sort of help.. I realise this is a pretty particular and involved problem and normally I'd try to avoid asking those sort of questions here but I've been working on this for ages and am at a total loss so I'll really take any help that I can get.

ndim

I can see a few areas to start working on, listed in no particular order:

1. Reduce the number of registers to push, as each push/pop pair takes four cycles. For example, avr-gcc allows you to remove a few registers from its register allocator, so you can just use them for register variables in that single ISR and be sure they still contain the value from last time. You might also get rid of the pushing of r1 and eor r1,r1 if your program never sets r1 to anything but 0.

2. Use a local temporary variable for the new value of the array index to save unnecessary load and store instructions to that volatile variable. Something like this:

volatile uint8_t amplitudePlace;

ISR() {
    uint8_t place = amplitudePlace;
    [ do all your stuff with place to avoid memory access to amplitudePlace ]
    amplitudePlace = place;
}

3. Count backwards from 59 to 0 instead of from 0 to 59 to avoid the separate comparison instruction (comparison with 0 happens anyway in subtraction). Pseudo code:

     sub  rXX,1
     goto Foo if non-zero
     movi rXX, 59
Foo:

instead of

     add  rXX,1
     compare rXX with 60
     goto Foo if >=
     movi rXX, 0
Foo:

4. Perhaps use pointers and pointer comparisons (with precalculated values!) instead of array indexes. It needs to be checked versus counting backwards which one is more efficient. Maybe align the arrays to 256 byte boundaries and use only 8-bit registers for the pointers to save on loading and saving the higher 8 bits of the addresses. (If you are running out of SRAM, you can still fit the content of 4 of those 60 byte arrays into one 256 byte array and still get the advantage of all addresses consisting of 8 constant high bits and the 8 variable lower bits.)

uint8_t array[60];
uint8_t *idx = array; /* shortcut for &array[0] */
const uint8_t *max_idx = &array[59];

ISR() {
    PORTFOO = *idx;
    ++idx;
    if (idx > max_idx) {
        idx = array;
    }
}

The problem is that pointers are 16 bit whereas your simple array index formerly was 8 bit in size. Helping with that might be a trick if you design your array addresses such that the higher 8 bits of the address are constants (in assembly code, hi8(array)), and you only deal with the lower 8 bits that actually change in the ISR. That does mean writing assembly code, though. The generated assembly code from above might be a good starting point for writing that version of the ISR in assembly.

5. If feasible from a timing point of view, adjust the sample buffer size to a power of 2 to replace the if-reset-to-zero part with a simple i = (i+1) & ((1 << POWER)-1);. If you want to go with the 8-bit/8-bit address split proposed in 4., perhaps even going to 256 for the power of two (and duplicating sample data as necessary to fill the 256 byte buffer) will even save you the AND instruction after the ADD.

6. In case the ISR only uses instructions which do not affect the status register, stop push and popping SREG.

General

The following might come in handy especially for manually checking all the other assembly code for assumptions:

firmware-%.lss: firmware-%.elf
        $(OBJDUMP) -h -S $< > $@

This generates a commented complete assembly language listing of the whole firmware image. You can use that to verify register (non-)usage. Note that startup code only run once long before you first enable interrupts will not interfere with your ISR's later exclusive use of registers.

If you decide to not write that ISR in assembly code directly, I would recommend you write the C code and check the generated assembly code after every compilation, in order to immediately observe what your changes end up generating.

You might end up writing a dozen or so variants of the ISR in C and assembly, adding up the cycles for each variant, and then chosing the best one.

Note Without doing any register reservation, I end up with something around 31 cycles for the ISR (excluding entering and leaving, which adds another 8 or 10 cycles). Completely getting rid of the register pushing would get the ISR down to 15 cycles. Changing to a sample buffer with a constant size of 256 bytes and giving the ISR exclusive use of four registers allows getting down to 6 cycles being spent in the ISR (plus 8 or 10 to enter/leave).

I'd say the best thing would be to write your ISR in pure assembler. It's very short and simple code, and you have the existing disassembler to guide you. But for something of this nature, you ought to be able to do better: e.g. use fewer registers, to save on push and pop; re-factor it so that it's not loading amplitudePlace from memory three separate times, etc.

Must you share all those variables with the rest of the program? Since every such variable you share must be volatile, the compiler isn't allowed optimize it. At least amplitudePlace looks like it could be changed to a local static variable, and then the compiler may be able to optimize it further.

To clarify, your interrupt should be this:

ISR(TIMER1_COMPA_vect) 
{
    PORTD = amplitudes[amplitudePlace++];
    amplitudePlace &= 63;
}

This will require your table to be 64 entries long. If you can choose the address of your table, you can get away with a single pointer, increment it, & it with 0xffBf.

If using variables instead of fixed constant is slowing things down, you can replace the pointer variable with a specific array:

PORTD = amplitudes13[amplitudePlace++];

Then you change the interrupt pointer to use a different function for each waveform. This is not likely to be a big savings, but we're getting down to 10's of cycles total.

As for the register usage thing. Once you get a really simple ISR like this, you can check the prolog and epilog of the ISR which push and pop the processor state. If your ISR only uses 1 register, you can do it in assembler and only save and restore that one register. This will reduce the interrupt overhead without affecting the rest of the program. Some compilers might do this for you, but I doubt it.

If there is time and space you can also create a long table and replace the ++ with +=freq where freq will cause the waveform to be an integer multiple of the base frequency (2x,3x,4x etc...) by skipping that many samples.

Instead of stepping through the table one entry at a time with varying interrupt rates, have you considered turning the problem around and stepping at a variable rate with a fixed interrupt frequency? That way the ISR itself would be heavier but you may afford to run it at a lower rate. Plus, with a little fixed-point arithmetic you can easily generate a wider spectrum of frequencies without messing around with multiple tables.

Anyway, there are a hundred and one ways of cheating to save cycles for this type of problem, if you can afford to bend your requirements a little to suite the hardware. For instance you could chain your timer's output to clock another hardware timer, and use the second timer's counter as your table index. You might reserve global registers or abuse unused I/Os to store variables. You can look up two entries at a time (or interpolate) in your COMPA interrupt and set up a tiny second COMPB interrupt in between to emit the buffered entry. And so on, and so forth.

With a little hardware abuse and carefully crafted assembly code you should be able to do this in 15 cycles or so without too much trouble. Whether you can make it play nice with the rest of the system is another question.

Maybe it suffices to get rid of the conditional and the comparison all together by using an arithmetic expression:

ISR(TIMER1_COMPA_vect) 
{
        PORTD = amplitudes[amplitudePlace];

        amplitudePlace = (amplitudePlace + 1) % numOfAmps;
}

If your CPU executes the modulo operation with reasonable speed, this should be much faster. If it still doesn't suffice, try writing this version in assembler.

来源：https://stackoverflow.com/questions/5646040/optimising-this-c-avr-code

标签

optimization

assembly

avr