Is integer multiplication really done at the same speed as addition on a modern CPU?

问题

I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?

I never can get any authoritative confirmation. My own research only adds questions. The speed tests usually show data that confuses me. Here is an example:

#include <stdio.h>
#include <sys/time.h>

unsigned int time1000() {
    timeval val;
    gettimeofday(&val, 0);
    val.tv_sec &= 0xffff;
    return val.tv_sec * 1000 + val.tv_usec / 1000;
}

int main() {
    unsigned int sum = 1, T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i + (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
    sum = 1;
    T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i * (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
}

The code above can show that multiplication is faster:

clang++ benchmark.cpp -o benchmark
./benchmark
746 1974919423
708 3830355456

But with other compilers, other compiler arguments, differently written inner loops, the results can vary and I cannot even get an approximation.

回答1:

Multiplication of two n-bit numbers can in fact be done in O(log n) circuit depth, just like addition.

Addition in O(log n) is done by splitting the number in half and (recursively) adding the two parts in parallel, where the upper half is solved for both the "0-carry" and "1-carry" case. Once the lower half is added, the carry is examined, and its value is used to choose between the 0-carry and 1-carry case.

Multiplication in O(log n) depth is also done through parallelization, where every sum of 3 numbers is reduced to a sum of just 2 numbers in parallel, and the sums are done in some manner like the above.
I won't explain it here, but you can find reading material on fast addition and multiplication by looking up "carry-lookahead" and "carry-save" addition.

So from a theoretical standpoint, since circuits are obviously inherently parallel (unlike software), the only reason multiplication would be asymptotically slower is the constant factor in the front, not the asymptotic complexity.

回答2:

No, they're not the same speed. Who told you that?

Agner Fog's instruction tables show that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.

回答3:

This is an even more complex answer than simply multiplication versus addition. In reality the answer will most likely NEVER be yes. Multiplication, electronically, is a much more complicated circuit. Most of the reasons why, is that multiplication is the act of a multiplication step followed by an addition step, remember what it was like to multiply decimal numbers prior to using a calculator.

The other thing to remember is that multiplication will take longer or shorter depending on the architecture of the processor you are running it on. This may or may not be simply company specific. While an AMD will most likely be different than an Intel, even an Intel i7 may be different from a core 2 (within the same generation), and certainly different between generations (especially the farther back you go).

In all TECHNICALITY, if multiplies were the only thing you were doing (without looping, counting etc...), multiplies would be 2 to (as ive seen on PPC architectures) 35 times slower. This is more an exercise in understanding your architecture, and electronics.

In Addition: It should be noted that a processor COULD be built for which ALL operations including a multiply take a single clock. What this processor would have to do is, get rid of all pipelining, and slow the clock so that the HW latency of any OPs circuit is less than or equal to the latency PROVIDED by the clock timing.

To do this would get rid of the inherent performance gains we are able to get when adding pipelining into a processor. Pipelining is the idea of taking a task and breaking it down into smaller sub-tasks that can be performed much quicker. By storing and forwarding the results of each sub-task between sub-tasks, we can now run a faster clock rate that only needs to allow for the longest latency of the sub-tasks, and not from the overarching task as a whole.

Picture of time through a multiply:

|--------------------------------------------------| Non-Pipelined

In the above diagram, the non-pipelined circuit takes 50 units of time. In the pipelined version, we have split the 50 units into 5 steps each taking 10 units of time, with a store step in between. It is EXTREMELY important to note that in the pipelined example, each of the steps can be working completely on their own and in parallel. For an operation to be completed, it must move through all 5 steps in order but another of the same operation with operands can be in step 2 as one is in step 1, 3, 4, and 5.

With all of this being said, this pipelined approach allows us to continuously fill the operator each clock cycle, and get a result out on each clock cycle IF we are able to order our operations such that we can perform all of one operation before we switch to another operation, and all we take as a timing hit is the original amount of clocks necessary to get the FIRST operation out of the pipeline.

Mystical brings up another good point. It is also important to look at the architecture from a more systems perspective. It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock.

All of this can be summed up as follows:

Each architecture is different from a lower level HW perspective as well as from a system perspective
FUNCTIONALLY, a multiply will always take more time than an add because it combines a true multiply along with a true addition step.
Understand the architecture you are trying to run your code on, and find the right balance between readability and getting truly the best performance from that architecture.

回答4:

Intel since Haswell has

add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)

Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info

But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.

You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).

Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).

See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.

Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.

More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.

Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.

And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.

回答5:

A multiplication requires a final step of an addition of, at minimum, the same size of the number; so it will take longer than an addition. In decimal:

    123
    112
   ----
   +246  ----
   123      | matrix generation  
  123    ----
  -----
  13776 <---------------- Addition

Same applies in binary, with a more elaborate reduction of the matrix.

That said, reasons why they may take the same amount of time:

To simplify the pipelined architecture, all regular instructions can be designed to take the same amount of cycles (exceptions are memory moves for instance, that depend on how long it takes to talk to external memory).
Since the adder for the final step of the multiplier is just like the adder for an add instruction... why not use the same adder by skipping the matrix generation and reduction? If they use the same adder, then obviously they will take the same amount of time.

Of course, there are more complex architectures where this is not the case, and you might obtain completely different values. You also have architectures that take several instructions in parallel when they don't depend on each other, and then you are a bit at the mercy of your compiler... and of the operating system.

The only way to run this test rigorously you would have to run in assembly and without an operating system - otherwise there are too many variables.

回答6:

Even if it were, that mostly tells us what restriction the clock puts on our hardware. We can't clock higher because heat(?), but the number of ADD instruction gates a signal could pass during a clock could be very many but a single ADD instruction would only utilize one of them. So while it may at some point take equally many clock cycles, not all of the propagation time for the signals is utilized.

If we could clock higher we could def. make ADD faster probably by several orders of magnitude.

回答7:

I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.

So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.

On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:

LDA $0000 - loaded a value into the Accumulator (5 cycles)
ADC $0002 - add with carry  (5 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
15 cycles total for an add

If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.

The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.

I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.

the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles

multiply by 2  (0000 0010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles).
ASL  - shift left (2 cycles).
STA $0004 - store the value in the Accumulator back to memory (5 cycles).
12 cycles plus call overhead.

multiply by (0101 1010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead

Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.

LDA instruction (1 cycle because it is an 8 bit value) $0000 (16 bit value requires two cycles to load) memory location (requires two cycles to load because of an 8 bit data bus)

Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.

To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.

Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.

Regards, Ken

回答8:

This really depends on your machine. Of course, integer multiplication is quite complex compared to addition, but quite a few AMD CPU can execute a multiplication in a single cycle. That is just as fast as addition.

Other CPUs take three or four cycles to do a multiplication, which is a bit slower than addition. But it's nowhere near the performance penalty you had to suffer ten years ago (back then a 32-Bit multiplication could take thirty-something cycles on some CPUs).

So, yes, multiplication is in the same speed class nowadays, but no, it's still not exactly as fast as addition on all CPUs.

回答9:

No it's not, and in fact it's noticeably slower (which translated into a 15% performance hit for the particular real-world program I was running).

I realized this myself when asking this question from just a few days ago here.

回答10:

Since the other answers deal with real, present-day devices -- which are bound to change and improve as time passes -- I thought we could look at the question from the theoretical side.

Proposition: When implemented in logic gates, using the usual algorithms, an integer multiplication circuit is O(log N) times slower than an addition circuit, where N is the number of bits in a word.

Proof: The time for a combinatorial circuit to stabilise is proportional to the depth of the longest sequence of logic gates from any input to any output. So we must show that a gradeschool multiply circuit is O(log N) times deeper than an addition circuit.

Addition is normally implemented as a half adder followed by N-1 full adders, with the carry bits chained from one adder to the next. This circuit clearly has depth O(N). (This circuit can be optimized in many ways, but the worst case performance will always be O(N) unless absurdly large lookup tables are used.)

To multiply A by B, we first need to multiply each bit of A with each bit of B. Each bitwise multiply is simply an AND gate. There are N^2 bitwise multiplications to perform, hence N^2 AND gates -- but all of them can execute in parallel, for a circuit depth of 1. This solves the multiplication phase of the gradeschool algorithm, leaving just the addition phase.

In the addition phase, we can combine the partial products using an inverted binary tree-shaped circuit to do many of the additions in parallel. The tree will be (log N) nodes deep, and at each node, we will be adding together two numbers with O(N) bits. This means each node can be implemented with an adder of depth O(N), giving a total circuit depth of O(N log N). QED.

来源：https://stackoverflow.com/questions/21819682/is-integer-multiplication-really-done-at-the-same-speed-as-addition-on-a-modern

标签

c++

performance

cpu

multiplication

addition