问题
My program frequently requires the following calculation to be performed:
Given:
- N is a 32-bit integer
- D is a 32-bit integer
- abs(N) <= abs(D)
- D != 0
- X is a 32-bit integer of any value
Find:
- X * N / D as a rounded integer that is X scaled to N/D (i.e. 10 * 2 / 3 = 7)
Obviously I could just use r=x*n/d
directly but I will often get overflow from the x*n
. If I instead do r=x*(n/d)
then I only get 0 or x due to integer division dropping the fractional component. And then there's r=x*(float(n)/d)
but I can't use floats in this case.
Accuracy would be great but isn't as critical as speed and being a deterministic function (always returning the same value given the same inputs).
N and D are currently signed but I could work around them being always unsigned if it helps.
A generic function that works with any value of X (and N and D, as long as N <= D) is ideal since this operation is used in various different ways but I also have a specific case where the value of X is a known constant power of 2 (2048, to be precise), and just getting that specific call sped up would be a big help.
Currently I am accomplishing this using 64-bit multiply and divide to avoid overflow (essentially int multByProperFraction(int x, int n, int d) { return (__int64)x * n / d; }
but with some asserts and extra bit fiddling for rounding instead of truncating).
Unfortunately, my profiler is reporting the 64-bit divide function as taking up way too much CPU (this is a 32-bit application). I've tried to reduce how often I need to do this calculation but am running out of ways around it, so I'm trying to figure out a faster method, if it is even possible. In the specific case where X is a constant 2048, I use a bit shift instead of multiply but that doesn't help much.
回答1:
Tolerate imprecision and use the 16 MSBits of n,d,x
Algorithm
while (|n| > 0xffff) n/2, sh++
while (|x| > 0xffff) x/2, sh++
while (|d| > 0xffff) d/2, sh--
r = n*x/d // A 16x16 to 32 multiply followed by a 32/16-bit divide.
shift r by sh.
When 64 bit
divide is expensive, the pre/post processing here may be worth to do a 32-bit divide - which will certainly be the big chunk of CPU.
If the compiler cannot be coaxed into doing a 32-bit/16-bit divide, then skip the while (|d| > 0xffff) d/2, sh--
step and do a 32/32 divide.
Use unsigned math as possible.
回答2:
The basic correct approach to this is simply (uint64_t)x*n/d
. That's optimal assuming d
is variable and unpredictable. But if d
is constant or changes infrequently, you can pre-generate constants such that exact division by d
can be performed as a multiplication followed by a bitshift. A good description of the algorithm, which is roughly what GCC uses internally to transform division by a constant into multiplication, is here:
http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
I'm not sure how easy it is to make it work for a "64/32" division (i.e. dividing the result of (uint64_t)x*n
), but you should be able to just break it up into high and low parts if nothing else.
Note that these algorithms are also available as libdivide.
回答3:
I've now benchmarked several possible solutions, including weird/clever ones from other sources like combining 32-bit div & mod & add or using peasant math, and here are my conclusions:
First, if you are only targeting Windows and using VSC++, just use MulDiv(). It is quite fast (faster than directly using 64-bit variables in my tests) while still being just as accurate and rounding the result for you. I could not find any superior method to do this kind of thing on Windows with VSC++, even taking into account restrictions like unsigned-only and N <= D.
However, in my case having a function with deterministic results even across platforms is even more important than speed. On another platform I was using as a test, the 64-bit divide is much, much slower than the 32-bit one when using the 32-bit libraries, and there is no MulDiv() to use. The 64-bit divide on this platform takes ~26x as long as a 32-bit divide (yet the 64-bit multiply is just as fast as the 32-bit version...).
So if you have a case like me, I will share the best results I got, which turned out to be just optimizations of chux's answer.
Both of the methods I will share below make use of the following function (though the compiler-specific intrinsics only actually helped in speed with MSVC in Windows):
inline u32 bitsRequired(u32 val)
{
#ifdef _MSC_VER
DWORD r = 0;
_BitScanReverse(&r, val | 1);
return r+1;
#elif defined(__GNUC__) || defined(__clang__)
return 32 - __builtin_clz(val | 1);
#else
int r = 1;
while (val >>= 1) ++r;
return r;
#endif
}
Now, if x is a constant that's 16-bit in size or smaller and you can pre-compute the bits required, I found the best results in speed and accuracy from this function:
u32 multConstByPropFrac(u32 x, u32 nMaxBits, u32 n, u32 d)
{
//assert(nMaxBits == 32 - bitsRequired(x));
//assert(n <= d);
const int bitShift = bitsRequired(n) - nMaxBits;
if( bitShift > 0 )
{
n >>= bitShift;
d >>= bitShift;
}
// Remove the + d/2 part if don't need rounding
return (x * n + d/2) / d;
}
On the platform with the slow 64-bit divide, the above function ran ~16.75x as fast as return ((u64)x * n + d/2) / d;
and with an average 99.999981% accuracy (comparing difference in return value from expected to range of x, i.e. returning +/-1 from expected when x is 2048 would be 100 - (1/2048 * 100) = 99.95% accurate) when testing it with a million or so randomized inputs where roughly half of them would normally have been an overflow. Worst-case accuracy was 99.951172%.
For the general use case, I found the best results from the following (and without needing to restrict N <= D to boot!):
u32 scaleToFraction(u32 x, u32 n, u32 d)
{
u32 bits = bitsRequired(x);
int bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
int sh = bitShift;
x >>= bitShift;
bits = bitsRequired(n);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh += bitShift;
n >>= bitShift;
bits = bitsRequired(d);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh -= bitShift;
d >>= bitShift;
// Remove the + d/2 part if don't need rounding
u32 r = (x * n + d/2) / d;
if( sh < 0 )
r >>= (-sh);
else //if( sh > 0 )
r <<= sh;
return r;
}
On the platform with the slow 64-bit divide, the above function ran ~18.5x as fast as using 64-bit variables and with 99.999426% average and 99.947479% worst-case accuracy.
I was able to get more speed or more accuracy by messing with the shifting, such as trying to not shift all the way down to 16-bit if it wasn't strictly necessary, but any increase in speed came at a high cost in accuracy and vice versa.
None of the other methods I tested came even close to the same speed or accuracy, most being slower than just using the 64-bit method or having huge loss in precision, so not worth going into.
Obviously, no guarantee that anyone else will get similar results on other platforms!
EDIT: Replaced some bit-twiddling hacks with plain code that actually ran faster anyway by letting the compiler do its job.
来源:https://stackoverflow.com/questions/57300788/fast-method-to-multiply-integer-by-proper-fraction-without-floats-or-overflow