Say I have implemented all the ADD, AND, SHF, JUMP, BR, LDW, LDB(load word load byte...) ........except MUL (multiple) instructions in an assembly machine. Now I want to write a
The general idea is the same as you (should have) learned in school when you did "long multiplication", except we do it in binary instead of decimal. Consider the two examples below:
1010 1234
x 1100 x 2121
---------- ---------
0000 1234
0000 2468
1010 1234
+ 1010 + 2468
--------- ---------
1111000 2617314
The example on the right is base-10 (decimal) and the example on the left is binary. Observe that the only digits you must multiply the top factor by is either 0 or 1. Multiplying by zero is easy, the answer is always zero, you don't even have to worry about adding that in. Multiplying by one is also easy, it just a matter of knowing "how far over to shift it". But that is easy, it as far over as you had to look to check that bit.
Start with a 16-bit working copy of your number, and a 16-bit accumulator set to zero. Shift the top number over and any time there is a one in the right-most digit you add the "working copy" to the accumulator. Whether or not there is a one or zero, you need to shift the "working copy" to the left one bit. When the "top" gets to zero you know you are done and the answer is in the accumulator.
There are some optimizations you can use so that you don't need as many 16-bit registers (or 8-bit register pairs), but I'll leave you to work out the details.
Seems you are using 8/16-bit processor similar to 8080, 6502, 6800 and analogs. Yep, a 8-iteration cycle of shifts and adds are enough and almost optimal. OTOH, if you have 1020 bytes for a constant table, the approach using the following formula could be the fastest one:
a*b = square(a+b)/4 - square(a-b)/4
If the arguments are unsigned, max of a+b is 510. You need to keep only integer parts of x**2/4 for any x, because fractional ones in the formula will compensate each other; so, the mapping is: 0 -> 0, 1 -> 0, 2 -> 1, 3 -> 2, 4 -> 4, ..., 510 -> 65025. For signed arguments, the table is two times smaller.
There are many other approaches for fast multiplication, including almost linear cost; see e.g. Donald Knuth's "The Art of Computer Programming" legendary book series, volume 2. But all they have too huge overhead in case of 8-bit arguments.