Is integer multiplication really done at the same speed as addition on a modern CPU?

后端未结

关注

 11  701

I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?

I never ca

相关标签:

11条回答

栀梦

2020-12-23 14:11

Since the other answers deal with real, present-day devices -- which are bound to change and improve as time passes -- I thought we could look at the question from the theoretical side.

Proposition: When implemented in logic gates, using the usual algorithms, an integer multiplication circuit is O(log N) times slower than an addition circuit, where N is the number of bits in a word.

Proof: The time for a combinatorial circuit to stabilise is proportional to the depth of the longest sequence of logic gates from any input to any output. So we must show that a gradeschool multiply circuit is O(log N) times deeper than an addition circuit.

Addition is normally implemented as a half adder followed by N-1 full adders, with the carry bits chained from one adder to the next. This circuit clearly has depth O(N). (This circuit can be optimized in many ways, but the worst case performance will always be O(N) unless absurdly large lookup tables are used.)

To multiply A by B, we first need to multiply each bit of A with each bit of B. Each bitwise multiply is simply an AND gate. There are N^2 bitwise multiplications to perform, hence N^2 AND gates -- but all of them can execute in parallel, for a circuit depth of 1. This solves the multiplication phase of the gradeschool algorithm, leaving just the addition phase.

In the addition phase, we can combine the partial products using an inverted binary tree-shaped circuit to do many of the additions in parallel. The tree will be (log N) nodes deep, and at each node, we will be adding together two numbers with O(N) bits. This means each node can be implemented with an adder of depth O(N), giving a total circuit depth of O(N log N). QED.

0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2020-12-23 14:12
Intel since Haswell has
- add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
- imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)
Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info

But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.

You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).

Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).

See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.

Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.

More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.

Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.

And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.
0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-23 14:16
I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.

So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.

On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:
```
LDA $0000 - loaded a value into the Accumulator (5 cycles)
ADC $0002 - add with carry  (5 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
15 cycles total for an add
```
If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.

The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.

I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.

the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles
```
multiply by 2  (0000 0010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles).
ASL  - shift left (2 cycles).
STA $0004 - store the value in the Accumulator back to memory (5 cycles).
12 cycles plus call overhead.
```
```
multiply by (0101 1010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead
```
Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.

LDA instruction (1 cycle because it is an 8 bit value) $0000 (16 bit value requires two cycles to load) memory location (requires two cycles to load because of an 8 bit data bus)

Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.

To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.

Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.

Regards, Ken
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-23 14:20

No, they're not the same speed. Who told you that?

Agner Fog's instruction tables show that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-23 14:23
Even on ARM (known for its high efficiency and small, clean design), integer multiplications take 3-7 cycles and than integer additions take 1 cycle.

However, an add/shift trick is often used to multiply integers by constants faster than the multiply instruction can calculate the answer.

The reason this works well on ARM is that ARM has a "barrel shifter", which allows many instructions to shift or rotate one of their arguments by 1-31 bits at zero cost, i.e. x = a + b and x = a + (b << s) take exactly the same amount of time.

Utilizing this processor feature, let's say you want to calculate a * 15. Then since 15 = 1111 (base 2), the following pseudocode (translated into ARM assembly) would implement the multiplication:
```
a_times_3 = a + (a << 1)                  // a * (0011 (base 2))
a_times_15 = a_times_3 + (a_times_3 << 2) // a * (0011 (base 2) + 1100 (base 2))
```
Similarly you could multiply by 13 = 1101 (base 2) using either of the following:
```
a_times_5 = a + (a << 2)
a_times_13 = a_times_5 + (a << 3)
```
```
a_times_3 = a + (a << 1)
a_times_15 = a_times_3 + (a_times_3 << 2)
a_times_13 = a_times_15 - (a << 1)
```
The first snippet is obviously faster in this case, but sometimes subtraction helps when translating a constant multiplication into add/shift combinations.

This multiplication trick was used heavily in the ARM assembly coding community in the late 80s, on the Acorn Archimedes and Acorn RISC PC (the origin of the ARM processor). Back then, a lot of ARM assembly was written by hand, since squeezing every last cycle out of the processor was important. Coders in the ARM demoscene developed many techniques like this for speeding up code, most of which are probably lost to history now that almost no assembly code is written by hand anymore. Compilers probably incorporate many tricks like this, but I'm sure there are many more that never made the transition from "black art optimization" to compiler implementation.

You can of course write explicit add/shift multiplication code like this in any compiled language, and the code may or may not run faster than a straight multiplication once compiled.

x86_64 may also benefit from this multiplication trick for small constants, although I don't believe shifting is zero-cost on the x86_64 ISA, in either the Intel or AMD implementations (x86_64 probably takes one extra cycle for each integer shift or rotate).
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2