I\'m searching for an algorithm to multiply two integer numbers that is better than the one below. Do you have a good idea about that? (The MCU - AT Tiny 84/85 or similar - wher
One approach is to unroll the loop. I don't have a compiler for the platform you're using so I can't look at the generated code, but an approach like this could help.
The performance of this code is less data-dependent -- you go faster in the worst case by not checking to see if you're in the best case. Code size is a bit bigger but not the size of a lookup table.
(Note code untested, off the top of my head. I'm curious about what the generated code looks like!)
#define UMUL16_STEP(a, b, shift) \
if ((b) & (1U << (shift))) result += ((a) << (shift)));
uint16_t umul16(uint16_t a, uint16_t b)
{
uint16_t result = 0;
UMUL16_STEP(a, b, 0);
UMUL16_STEP(a, b, 1);
UMUL16_STEP(a, b, 2);
UMUL16_STEP(a, b, 3);
UMUL16_STEP(a, b, 4);
UMUL16_STEP(a, b, 5);
UMUL16_STEP(a, b, 6);
UMUL16_STEP(a, b, 7);
UMUL16_STEP(a, b, 8);
UMUL16_STEP(a, b, 9);
UMUL16_STEP(a, b, 10);
UMUL16_STEP(a, b, 11);
UMUL16_STEP(a, b, 12);
UMUL16_STEP(a, b, 13);
UMUL16_STEP(a, b, 14);
UMUL16_STEP(a, b, 15);
return result;
}
Update:
Depending on what your compiler does, the UMUL16_STEP macro can change. An alternative might be:
#define UMUL16_STEP(a, b, shift) \
if ((b) & (1U << (shift))) result += (a); (a) << 1;
With this approach the compiler might be able to use the sbrc
instruction to avoid branches.
My guess for how the assembler should look per bit, r0:r1 is the result, r2:r3 is a
and r4:r5 is b
:
sbrc r4, 0
add r0, r2
sbrc r4, 0
addc r1, r3
lsl r2
rol r3
This should execute in constant time without a branch. Test the bits in r4
and then test the bits in r5
for the higher eight bits. This should execute the multiplication in 96 cycles based on my reading of the instruction set manual.