Faster 16bit multiplication algorithm for 8-bit MCU

前端 未结 6 1433
没有蜡笔的小新
没有蜡笔的小新 2021-02-12 11:27

I\'m searching for an algorithm to multiply two integer numbers that is better than the one below. Do you have a good idea about that? (The MCU - AT Tiny 84/85 or similar - wher

6条回答
  •  眼角桃花
    2021-02-12 11:49

    One approach is to unroll the loop. I don't have a compiler for the platform you're using so I can't look at the generated code, but an approach like this could help.

    The performance of this code is less data-dependent -- you go faster in the worst case by not checking to see if you're in the best case. Code size is a bit bigger but not the size of a lookup table.

    (Note code untested, off the top of my head. I'm curious about what the generated code looks like!)

    #define UMUL16_STEP(a, b, shift) \
        if ((b) & (1U << (shift))) result += ((a) << (shift)));
    
    uint16_t umul16(uint16_t a, uint16_t b)
    {
        uint16_t result = 0;
    
        UMUL16_STEP(a, b, 0);
        UMUL16_STEP(a, b, 1);
        UMUL16_STEP(a, b, 2);
        UMUL16_STEP(a, b, 3);
        UMUL16_STEP(a, b, 4);
        UMUL16_STEP(a, b, 5);
        UMUL16_STEP(a, b, 6);
        UMUL16_STEP(a, b, 7);
        UMUL16_STEP(a, b, 8);
        UMUL16_STEP(a, b, 9);
        UMUL16_STEP(a, b, 10);
        UMUL16_STEP(a, b, 11);
        UMUL16_STEP(a, b, 12);
        UMUL16_STEP(a, b, 13);
        UMUL16_STEP(a, b, 14);
        UMUL16_STEP(a, b, 15);
    
        return result;
    }
    

    Update:

    Depending on what your compiler does, the UMUL16_STEP macro can change. An alternative might be:

    #define UMUL16_STEP(a, b, shift) \
        if ((b) & (1U << (shift))) result += (a); (a) << 1;
    

    With this approach the compiler might be able to use the sbrc instruction to avoid branches.

    My guess for how the assembler should look per bit, r0:r1 is the result, r2:r3 is a and r4:r5 is b:

    sbrc r4, 0
    add r0, r2
    sbrc r4, 0
    addc r1, r3
    lsl r2
    rol r3
    

    This should execute in constant time without a branch. Test the bits in r4 and then test the bits in r5 for the higher eight bits. This should execute the multiplication in 96 cycles based on my reading of the instruction set manual.

提交回复
热议问题