As somebody new to assembly, I use gcc for reverse engineering. But now I ran in a somehow funny problem: I try to multiply two 64bit-integer for x86-64. The C - code looks
It looks like you can't use imul
without a bunch of extra code, since CF and OF are both set the same way. As the "operation" section of the manual describes, they're set if the full 128b result doesn't match sign_extend(low_half_result)
. So you're right, even the multi-operand forms of imul
still have some signed behaviour. It would be nice if they were like add
/sub
and set OF and CF independently, so you could look at CF for unsigned data or OF for signed data.
One of the best ways to find a good asm sequence for something is to ask a compiler. C doesn't have convenient integer-overflow detection, but Rust does.
I compiled this function to return the value and the unsigned-wraparound detection bool. Apparently Rust's ABI returns them passing pointer as a hidden first arg, instead of in rdx:rax like I think the C ABI would for such a small struct. :(
pub fn overflowing_mul(a: u64, b: u64) -> (u64, bool) {
a.overflowing_mul(b)
}
# frame-pointer boilerplate elided
mov rax, rsi
mul rdx
mov qword ptr [rdi], rax
seto byte ptr [rdi + 8]
mov rax, rdi # return the pointer to the return-value
ret
Asm output from the Godbolt compiler explorer (Rust 1.7.0). This more or less confirms that the mov
instruction and the extra uop for a one-operand full multiply is more efficient than anything we could do with extra checks after a two-operand imul
.
The documentation for mul says
"The OF and CF flags are set to 0 if the upper half of the result is 0; otherwise, they are set to 1."
So in summary, use mul
and check OF
or CF
to see if the high half is non-zero.
mul
vs. imul
trivia:Only the upper half of the full-multiply (N x N => 2N) result differs between imul
and mul
. I think Intel picked imul
as the one that would have multiple explicit operands so that
imul r32, r32, sign-extended-imm8
would make more sense, because sign-extension is probably more useful than zero-extension.
I only just realised that the flag results from imul
were signed-only, though. Interesting point.
why does gcc not use
mul
for unsigned multiplication?
Because one-operand mul
/imul
are slower (2 uops instead of 1 on Intel CPUs, according to Agner Fog's insn tables. See also the x86 tag wiki). They also use more registers: They require one of their inputs in rax
, and produce their outputs in rdx:rax
, so extra mov
instructions are usually required to move data in/out of those regs.
Thus, imul r64, r64
is a better choice than mul r64
, if you don't care about the flag result.
On Intel CPUs imul r64,r64
is actually faster than mul r32
. That's not the case on some other CPUs, including AMD Bulldozer-family where 64bit multiplies are somewhat slower. But since mul r32
puts its results into edx:eax
instead of just one destination register, they're not direct replacements for each other in most cases anyway.
When multiplying two values, the least significant bits of the result are exactly the same, whether you do unsigned or signed multiplication. So, if you multiply two 32-bit values, you get a 64-bit result, the low 32-bits of which are the same, whether the multiplication is signed or unsigned. Same thing for a 64-bit multiplication, which produces a 128-bit result, the lower 64-bits of which are identical in both cases.
As such, compilers often use the IMUL
instruction (whose mnemonic suggests signed multiplication) for both types of multiplication because it is more flexible than MUL
, and generally faster. Whereas MUL
comes in only one form (allowing an arbitrary general-purpose register or memory location to be multiplied by the implied destination register AL/AX/EAX/RAX), IMUL
has many forms, including a one-operand form (same as MUL
), a two-operand form (register or memory × register or memory or immediate), and a three-operand form (register or memory × immediate, storing the result in a third destination register). More details are available in Intel's documentation (see the x86 tag wiki for links), or quick reference for MUL and IMUL.
The reason the compiler can get away with using IMUL
all the time is because you throw away the high-order bits of the result. When you do a 32-bit × 32-bit multiplication and store the result in a 32-bit variable, the upper 32-bits of the entire 64-bit result were discarded. Again, same for a 64-bit × 64-bit multiplication, which discards the upper 64-bits of the 128-bit result, leaving only the lower 64-bits, which are the same whether it is a signed or unsigned multiply.
Quoting from the Intel manual:
The two- and three-operand forms [of IMUL] may also be used with unsigned operands because the lower half of the product is the same regardless if the operands are signed or unsigned. The CF and OF flags, however, cannot be used to determine if the upper half of the result is non-zero.
Peter Cordes has also explained this very well in a section of his larger answer to a very general question on two's-complement arithmetic operations.
Anyway, when writing the assembly code yourself, you have to decide whether you want to do the same thing the compiler does and throw away the upper bits of the products, or whether you want to keep them. If you don't care about the upper bits and assume that the operation will not overflow, write the same code as the compiler does.
If you do care about the upper bits, just use the MUL
instruction, which sets the CF and OF flags if the product of the multiplication is larger than can fit into the type of its operands.
mov rax, QWORD PTR [a] ; put 64-bit operand 'a' into RAX
mov rbx, QWORD PTR [b] ; put 64-bit operand 'b' into RBX
mul rbx ; multiply 'a' * 'b'
; 128-bit result is returned in RDX:RAX (upper-bits:lower-bits)
jo ProductOverflowed
Using MUL
here is almost certainly more efficient than trying to find a way to use IMUL
and testing the high 64-bits afterwards to see if they are non-zero (which would indicate an overflow). Simply having a non-predictable branch would put you way behind in performance, compared to the 1 or 2 μops you would stand to save with IMUL
.