问题
How can I tell the MSVC compiler to use the 64bit/32bit division operation to compute the result of the following function for the x86-64 target:
#include <stdint.h>
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
if (a > b)
return ((uint64_t)b<<32) / a; //Yes, this must be casted because the result of b<<32 is undefined
else
return uint32_t(-1);
}
I would like the code, when the if
statement is true, to compile to use the 64bit/32bit division operation, e.g. something like this:
; Assume arguments on entry are: Dividend in EDX, Divisor in ECX
mov edx, edx ;A dummy instruction to indicate that the dividend is already where it is supposed to be
xor eax,eax
div ecx ; EAX = EDX:EAX / ECX
...however the x64 MSVC compiler insists on using the 128bit/64bit div
instruction, such as:
mov eax, edx
xor edx, edx
shl rax, 32 ; Scale up the dividend
mov ecx, ecx
div rcx ;RAX = RDX:RAX / RCX
See: https://www.godbolt.org/z/VBK4R71
According to the answer to this question, the 128bit/64bit div
instruction is not faster than the 64bit/32bit div
instruction.
This is a problem because it unnecessarily slows down my DSP algorithm which makes millions of these scaled divisions.
I tested this optimization by patching the executable to use the 64bit/32bit div instruction: The performance increased 28% according to the two timestamps yielded by the rdtsc
instructions.
(Editor's note: presumably on some recent Intel CPU. AMD CPUs don't need this micro-optimization, as explained in the linked Q&A.)
回答1:
No current compilers (gcc/clang/ICC/MSVC) will do this optimization from portable ISO C source, even if you let them prove that b < a
so the quotient will fit in 32 bits. (For example with GNU C if(b>=a) __builtin_unreachable();
on Godbolt). This is a missed optimization; until that's fixed, you have to work around it with intrinsics or inline asm.
(Or use a GPU or SIMD instead; if you have the same divisor for many elements see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly.)
_udiv64 is available starting in Visual Studio 2019 RTM.
In C mode (-TC
) it's apparently always defined. In C++ mode, you need to #include <immintrin.h>
, as per the Microsoft docs. or intrin.h
.
https://godbolt.org/z/vVZ25L (Or on Godbolt.ms because recent MSVC on the main Godbolt site is not working1.)
#include <stdint.h>
#include <immintrin.h> // defines the prototype
// pre-condition: a > b else 64/32-bit division overflows
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
uint32_t remainder;
uint64_t d = ((uint64_t) b) << 32;
return _udiv64(d, a, &remainder);
}
int main() {
uint32_t c = ScaledDiv(5, 4);
return c;
}
_udiv64 will produce 64/32 div. The two shifts left and right are a missed optimization.
;; MSVC 19.20 -O2 -TC
a$ = 8
b$ = 16
ScaledDiv PROC ; COMDAT
mov edx, edx
shl rdx, 32 ; 00000020H
mov rax, rdx
shr rdx, 32 ; 00000020H
div ecx
ret 0
ScaledDiv ENDP
main PROC ; COMDAT
xor eax, eax
mov edx, 4
mov ecx, 5
div ecx
ret 0
main ENDP
So we can see that MSVC doesn't do constant-propagation through _udiv64
, even though in this case it doesn't overflow and it could have compiled main
to just mov eax, 0ccccccccH
/ ret
.
UPDATE #2 https://godbolt.org/z/n3Dyp- Added a solution with Intel C++ Compiler, but this is less efficient and will defeat constant-propagation because it's inline asm.
#include <stdio.h>
#include <stdint.h>
__declspec(regcall, naked) uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
__asm mov edx, eax
__asm xor eax, eax
__asm div ecx
__asm ret
// implicit return of EAX is supported by MSVC, and hopefully ICC
// even when inlining + optimizing
}
int main()
{
uint32_t a = 3 , b = 4, c = ScaledDiv(a, b);
printf( "(%u << 32) / %u = %u\n", a, b, c);
uint32_t d = ((uint64_t)a << 32) / b;
printf( "(%u << 32) / %u = %u\n", a, b, d);
return c != d;
}
Footnote 1: Matt Godbolt's main site's non-WINE MSVC compilers are temporarily(?) gone. Microsoft runs https://www.godbolt.ms/ to host the recent MSVC compilers on real Windows, and normally the main Godbolt.org site relayed to that for MSVC.)
It seems godbolt.ms will generate short links, but not expand them again! Full links are better anyway for their resistance to link-rot.
回答2:
@Alex Lopatin's answer shows how to use _udiv64
to get non-terrible scalar code (despite MSVC's stupid missed optimization shifting left/right).
For compilers that support GNU C inline asm (including ICC), you can use that instead of the inefficient MSVC inline asm syntax that has a lot of overhead for wrapping a single instruction. See What is the difference between 'asm', '__asm' and '__asm__'? for an example wrapping 64-bit / 32-bit => 32-bit idiv
. (Use it for div
by just changing the mnemonic and the types to unsigned.) GNU C doesn't have an intrinsic for 64 / 32 or 128 / 64 division; it's supposed to optimize pure C. But unfortunately GCC / Clang / ICC have missed optimizations for this case even using if(a<=b) __builtin_unreachable();
to promise that a>b
.
But that's still scalar division, with pretty poor throughput.
Perhaps you can a GPU for your DSP task? If you have a large enough batch of work (and the rest of your algorithm is GPU-friendly) then it's probably worth the overhead of the communication round trip to the GPU.
If you're using the CPU, then anything we can suggest will benefit from parallelizing over multiple cores, so do that for more throughput.
x86 SIMD (SSE4/AVX2/AVX512*) doesn't have SIMD integer division in hardware. The Intel SVML functions _mm_div_epu64 and _mm256_div_epu64 are not intrinsics for a real instruction, they're slow functions that maybe unpack to scalar or compute multiplicative inverses. Or whatever other trick they use; possibly the 32-bit division functions convert to SIMD vectors of double
, especially if AVX512 is available. (Intel still calls them "intrinsics" maybe because they're like built-in function that it understands and can do constant-propagation through. They're probably as efficient as they can be, but that's "not very", and they need to handle the general case, not just your special case with the low half of one divisor being all zero and the quotient fitting in 32 bits.)
If you have the same divisor for many elements, see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly. (You should adapt that technique to bake in the shifting of the dividend without actually doing it, leaving the all-zero low half implicit.)
If your divisor is always varying, and this isn't a middle step in some larger SIMD-friendly algorithm, scalar division may well be your best bet if you need exact results.
You could get big speedups from using SIMD float
if 24-bit mantissa precision is sufficient
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
return ((1ULL<<32) * (float)b) / a;
}
(float)(1ULL<<32)
is a compile-time constant 4294967296.0f
.
This does auto-vectorize over an array, with gcc and clang even without -ffast-math
(but not MSVC). See it on Godbolt. You could port gcc or clang's asm back to intrinsics for MSVC; they use some FP tricks for packed-conversion of unsigned integers to/from float without AVX512. Non-vectorized scalar FP will probably be slower than plain integer on MSVC, as well as less accurate.
For example, Skylake's div r32
throughput is 1 per 6 cycles. But its AVX vdivps ymm
throughput is one instruction (of 8 float
s) per 5 cycles. Or for 128-bit SSE2, divps xmm
has one per 3 cycle throughput. So you get about 10x the division throughput from AVX on Skylake. (8 * 6/5 = 9.6) Older microarchitectures have much slower SIMD FP division, but also somewhat slower integer division. In general the ratio is smaller because older CPUs don't have as wide SIMD dividers, so 256-bit vdivps
has to run the 128-bit halves through separately. But there's still plenty of gain to be had, like better than a factor of 4 on Haswell. And Ryzen has vdivps ymm
throughput of 6c, but div 32
throughput of 14-30 cycles. So that's an even bigger speedup than Skylake.
If the rest of your DSP task can benefit from SIMD, the overall speedup should be very good. float
operations have higher latency, so out-of-order execution has to work harder to hide that latency and overlap execution of independent loop iterations. So IDK whether it would be better for you to just convert to float and back for this one operation, or to change your algorithm to work with float
everywhere. It depends what else you need to do with your numbers.
If your unsigned numbers actually fit into signed 32-bit integers, you can use direct hardware support for packed SIMD int32 -> float conversion. Otherwise you need AVX512F for packed uint32
-> float with a single instruction, but that can be emulated with some loss of efficiency. That's what gcc/clang do when auto-vectorizing with AVX2, and why MSVC doesn't auto-vectorize.
MSVC does auto-vectorize with int32_t
instead of uint32_t
(and gcc/clang can make more efficient code), so prefer that if the highest bit of your integer inputs and/or outputs can't be set. (i.e. the 2's complement interpretation of their bit-patterns will be non-negative.)
With AVX especially, vdivps
is slow enough to mostly hide the throughput costs of converting from integer and back, unless there's other useful work that could have overlapped instead.
Floating point precision:
A float
stores numbers as significand * 2^exp
where the significand is in the range [1.0, 2.0)
. (Or [0, 1.0)
for subnormals). A single-precision float
has 24-bits of significand precision, including the 1 implicit bit.
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
So the 24 most-significant digits of an integer can be represented, the rest lost to rounding error. An integer like (uint64_t)b << 32
is no problem for float
; that just means a larger exponent. The low bits are all zero.
For example, b = 123105810
gives us 528735427897589760
for b64 << 32
. Converting that to float
directly from 64-bit integer gives us 528735419307655168
, a rounding error of 0.0000016%, or about 2^-25.8. That's unsurprising: the max rounding error is 0.5ulp (units in the last place), or 2^-25, and this number was even so it had 1 trailing zero anyway. That's the same relative error we'd get from converting 123105810
; the resulting float
is also the same except for its exponent field (which is higher by 32).
(I used https://www.h-schmidt.net/FloatConverter/IEEE754.html to check this.)
float
's max exponent is large enough to hold integers outside the INT64_MIN
to INT64_MAX
range. The low bits of the large integers that float
can represent are all zero, but that's exactly what you have with b<<32
. So you're only losing the low 9 bits of b
in the worst case where it's full-range and odd.
If the important part of your result is the most-significant bits, and having the low ~9 integer bits = rounding error is ok after converting back to integer, then float
is perfect for you.
If float
doesn't work, double
may be an option.
divpd
is about twice as slow as divps
on many CPUs, and only does half as much work (2 double
elements instead of 4 float
). So you lose a factor of 4 throughput this way.
But every 32-bit integer can be represented exactly as a double
. And by converting back with truncation towards zero, I think you get exact integer division for all pairs of inputs, unless double-rounding is a problem (first to nearest double, then truncation). You can test it with
// exactly correct for most inputs at least, maybe all.
uint32_t quotient = ((1ULL<<32) * (double)b) / a;
The unsigned long long constant (1ULL<<32)
is converted to double
, so you have 2x u32 -> double conversions (of a
and b
), a double multiply, a double divide, and a double -> u32 conversion. x86-64 can do all of these efficiently with scalar conversions (by zero extending uint32_t
into int64_t
, or ignoring the high bits of a double->int64_t conversion), but it will probably still be slower than div r32
.
Converting u32 -> double and back (without AVX512) is maybe even more expensive that converting u32 -> float, but clang does auto-vectorize it.
(Just change float
to double
in the godbolt link above). Again it would help a lot if your inputs were all <= INT32_MAX
so they could be treated as signed integers for FP conversion.
If double-rounding is a problem, you could maybe set the FP rounding mode to truncation instead of the default round-to-nearest, if you don't use FP for anything else in the thread where your DSP code is running.
来源:https://stackoverflow.com/questions/56657236/how-can-i-instruct-the-msvc-compiler-to-use-a-64bit-32bit-division-instead-of-th