In C++, say that:
uint64_t i;
uint64_t j;
then i * j
will yield an uint64_t
that has as value the lower part of t
If you're using gcc and the version you have supports 128 bit numbers (try using __uint128_t) than performing the 128 multiply and extracting the upper 64 bits is likely to be the most efficient way of getting the result.
If your compiler doesn't support 128 bit numbers, then Yakk's answer is correct. However, it may be too brief for general consumption. In particular, an actual implementation has to be careful of overflowing 64 bit integars.
The simple and portable solution he proposes is to break each of a and b into 2 32-bit numbers and then multiply those 32 bit numbers using the 64 bit multiply operation. If we write:
uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;
then it is obvious that:
a = (a_hi << 32) + a_lo;
b = (b_hi << 32) + b_lo;
and:
a * b = ((a_hi << 32) + a_lo) * ((b_hi << 32) + b_lo)
= ((a_hi * b_hi) << 64) +
((a_hi * b_lo) << 32) +
((b_hi * a_lo) << 32) +
a_lo * b_lo
provided the calculation is performed using 128 bit (or greater) arithmetic.
But this problem requires that we perform all the calculcations using 64 bit arithmetic, so we have to worry about overflow.
Since a_hi, a_lo, b_hi, and b_lo are all unsigned 32 bit numbers, their product will fit in an unsigned 64 bit number without overflow. However, the intermediate results of the above calculation will not.
The following code will implement mulhi(a, b) when the mathemetics must be performed modulo 2^64:
uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;
uint64_t a_x_b_hi = a_hi * b_hi;
uint64_t a_x_b_mid = a_hi * b_lo;
uint64_t b_x_a_mid = b_hi * a_lo;
uint64_t a_x_b_lo = a_lo * b_lo;
uint64_t carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
(uint64_t)(uint32_t)b_x_a_mid +
(a_x_b_lo >> 32) ) >> 32;
uint64_t multhi = a_x_b_hi +
(a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
carry_bit;
return multhi;
As Yakk points out, if you don't mind being off by +1 in the upper 64 bits, you can omit the calculation of the carry bit.
Here's the asm for ARMv8 or Aarch64 version:
// High (p1) and low (p0) product
uint64_t p0, p1;
// multiplicand and multiplier
uint64_t a = ..., b = ...;
p0 = a*b; asm ("umulh %0,%1,%2" : "=r"(p1) : "r"(a), "r"(b));
And here's the asm for old DEC compilers:
p0 = a*b; p1 = asm("umulh %a0, %a1, %v0", a, b);
If you have x86's BMI2 and would like to use mulxq
:
asm ("mulxq %3, %0, %1" : "=r"(p0), "=r"(p1) : "d"(a), "r"(b));
And the generic x86 multiply using mulq
:
asm ("mulq %3" : "=a"(p0), "=d"(p1) : "a"(a), "g"(b) : "cc");
This is a unit-tested version that I came up with tonight that provides the full 128-bit product. On inspection it seems to be simpler than most of the other solutions online (in e.g. Botan library and other answers here) because it takes advantage of the how the MIDDLE PART does not overflow as explained in the code comments.
For context I wrote it for this github project: https://github.com/catid/fp61
//------------------------------------------------------------------------------
// Portability Macros
// Compiler-specific force inline keyword
#ifdef _MSC_VER
# define FP61_FORCE_INLINE inline __forceinline
#else
# define FP61_FORCE_INLINE inline __attribute__((always_inline))
#endif
//------------------------------------------------------------------------------
// Portable 64x64->128 Multiply
// CAT_MUL128: r{hi,lo} = x * y
// Returns low part of product, and high part is set in r_hi
FP61_FORCE_INLINE uint64_t Emulate64x64to128(
uint64_t& r_hi,
const uint64_t x,
const uint64_t y)
{
const uint64_t x0 = (uint32_t)x, x1 = x >> 32;
const uint64_t y0 = (uint32_t)y, y1 = y >> 32;
const uint64_t p11 = x1 * y1, p01 = x0 * y1;
const uint64_t p10 = x1 * y0, p00 = x0 * y0;
/*
This is implementing schoolbook multiplication:
x1 x0
X y1 y0
-------------
00 LOW PART
-------------
00
10 10 MIDDLE PART
+ 01
-------------
01
+ 11 11 HIGH PART
-------------
*/
// 64-bit product + two 32-bit values
const uint64_t middle = p10 + (p00 >> 32) + (uint32_t)p01;
/*
Proof that 64-bit products can accumulate two more 32-bit values
without overflowing:
Max 32-bit value is 2^32 - 1.
PSum = (2^32-1) * (2^32-1) + (2^32-1) + (2^32-1)
= 2^64 - 2^32 - 2^32 + 1 + 2^32 - 1 + 2^32 - 1
= 2^64 - 1
Therefore it cannot overflow regardless of input.
*/
// 64-bit product + two 32-bit values
r_hi = p11 + (middle >> 32) + (p01 >> 32);
// Add LOW PART and lower half of MIDDLE PART
return (middle << 32) | (uint32_t)p00;
}
#if defined(_MSC_VER) && defined(_WIN64)
// Visual Studio 64-bit
# include <intrin.h>
# pragma intrinsic(_umul128)
# define CAT_MUL128(r_hi, r_lo, x, y) \
r_lo = _umul128(x, y, &(r_hi));
#elif defined(__SIZEOF_INT128__)
// Compiler supporting 128-bit values (GCC/Clang)
# define CAT_MUL128(r_hi, r_lo, x, y) \
{ \
unsigned __int128 w = (unsigned __int128)x * y; \
r_lo = (uint64_t)w; \
r_hi = (uint64_t)(w >> 64); \
}
#else
// Emulate 64x64->128-bit multiply with 64x64->64 operations
# define CAT_MUL128(r_hi, r_lo, x, y) \
r_lo = Emulate64x64to128(r_hi, x, y);
#endif // End CAT_MUL128
Long multiplication should be ok performance.
Split a*b
into (hia+loa)*(hib+lob)
. This gives 4 32 bit multiplies plus some shifts. Do them in 64 bits, and do the carries manually, and you'll get the high portion.
Note that an approximation of the high portion can be done with fewer multiplies -- accurate within 2^33 or so with 1 multiply, and within 1 with 3 multiplies.
I do not think there is a portable alternative.
TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64
compiles nicely, to a single full-multiply or high-half multiply instruction. No need to mess around with inline asm.
Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef
for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)
GNU C (gcc, clang, or ICC) has unsigned __int128 on most 64-bit platforms. (Or in older versions, __uint128_t
). GCC doesn't implement this type on 32-bit platforms, though.
This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half. (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)
MSVC also has a __umulh intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64. The docs also mention IPF (IA-64) having _umul128
available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)
#define HAVE_FAST_mul64 1
#ifdef __SIZEOF_INT128__ // GNU C
static inline
uint64_t mulhi64(uint64_t a, uint64_t b) {
unsigned __int128 prod = a * (unsigned __int128)b;
return prod >> 64;
}
#elif defined(_M_X64) || defined(_M_ARM64) // MSVC
// MSVC for x86-64 or AArch64
// possibly also || defined(_M_IA64) || defined(_WIN64)
// but the docs only guarantee x86-64! Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux
// https://docs.microsoft.com/en-gb/cpp/intrinsics/umulh
#include <intrin.h>
#define mulhi64 __umulh
#elif defined(_M_IA64) // || defined(_M_ARM) // MSVC again
// https://docs.microsoft.com/en-gb/cpp/intrinsics/umul128
// incorrectly say that _umul128 is available for ARM
// which would be weird because there's no single insn on AArch32
#include <intrin.h>
static inline
uint64_t mulhi64(uint64_t a, uint64_t b) {
unsigned __int64 HighProduct;
(void)_umul128(a, b, &HighProduct);
return HighProduct;
}
#else
# undef HAVE_FAST_mul64
uint64_t mulhi64(uint64_t a, uint64_t b); // non-inline prototype
// or you might want to define @craigster0's version here so it can inline.
#endif
For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul
instruction, and a couple mov
s to deal with the calling convention (which should optimize away after this inlines).
From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):
# x86-64 gcc7.3. clang and ICC are the same. (x86-64 System V calling convention)
# MSVC makes basically the same function, but with different regs for x64 __fastcall
mov rax, rsi
mul rdi # RDX:RAX = RAX * RDI
mov rax, rdx
ret
(or with clang -march=haswell
to enable BMI2: mov rdx, rsi
/ mulx rax, rcx, rdi
to put the high-half in RAX directly. gcc is dumb and still uses an extra mov
.)
For AArch64 (with gcc unsigned __int128
or MSVC with __umulh
):
test_var:
umulh x0, x0, x1
ret
With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits. But gcc amusingly uses shld
(see the Godbolt link).
Unfortunately current compilers don't optimize @craigster0's nice portable version. You get 8x shr r64,32
, 4x imul r64,r64
, and a bunch of add
/mov
instructions for x86-64. i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results. So if you want something that takes advantage of 64-bit CPUs, you need some #ifdef
s.
A full-multiply mul 64
instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64
which only produces a 64-bit result. So the __int128
/ intrinsic version is 5 to 10 times cheaper in latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.
Check it out on the Godbolt compiler explorer on the above link.
gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128
multiply.