问题
I have some places in my code where I want to assure that a division of 2 arbitrary floating point numbers (32 bit single precision) won't overflow. The target/compiler does not guarantee (explicitly enough) nice handling of -INF/INF and (does not fully guarantees IEEE 754 for the exceptional values - (possibly undefined) - and target might change). Also I cannot make save assumtions on the inputs for this few special places and I am bound to C90 standard libraries.
I have read What Every Computer Scientist Should Know About Floating-Point Arithmetic but to be honest, I am a little bit lost.
So... I want to ask the community, if the following piece of code would do the trick, and if there are better/faster/exacter/correcter ways to do it:
#define SIGN_F(val) ((val >= 0.0f)? 1.0f : -1.0f)
float32_t safedivf(float32_t num, float32_t denum)
{
const float32_t abs_denum = fabs(denum);
if((abs_denum < 1.0f) && ((abs_denum * FLT_MAX) <= (float32_t)fabs(num))
return SIGN_F(denum) * SIGN_F(num) * FLT_MAX;
else
return num / denum;
}
Edit: Changed ((abs_denum * FLT_MAX) < (float32_t)fabs(num))
to ((abs_denum * FLT_MAX) <= (float32_t)fabs(num))
as recommeded by Pascal Cuoq.
回答1:
In ((abs_denum * FLT_MAX) < (float32_t)fabs(num)
, the product abs_denum * FLT_MAX
may round down and end up equal to fabs(num)
. This does not mean that num / denum
is not mathematically larger than FLT_MAX
, and you should be worried that it might happen to cause the overflow that you want to avoid. You had better replace this <
by <=
.
For an alternative solution, if a double
type is available and is wider than float
, it may be more economical to compute (double)num/(double)denum
. If float
is binary32ish and double
is binary64ish, the only way the double
division can overflow is if denum
is (a) zero (and if denum
is a zero your code is also problematic).
double dbl_res = (double)num/(double)denum;
float res = dbl_res < -FLT_MAX ? -FLT_MAX : dbl_res > FLT_MAX ? FLT_MAX : (float)dbl_res;
回答2:
You may try to extract the exponents and the mantissas of num and denum, and make sure that condition:
((exp(num) - exp (denum)) > max_exp) && (mantissa(num) >= mantissa(denum))
And according to the sign of the inputs, generate the corresponding INF.
回答3:
Carefully work with num, denom
when the quotient is near FLT_MAX
.
The following uses tests inspired by OP but stays away from results near FLT_MAX
. As @Pascal Cuoq points out that rounding may just push the result over the edge. Instead it uses thresholds of FLT_MAX/FLT_RADIX
and FLT_MAX*FLT_RADIX
.
By scaling with FLT_RADIX
, typically 2, code should always get exact results. Rounding under any rounding mode is not expected to infect the result.
In terms of speed, the "happy path", that is, when results certainly do not overflow should be a speedy calculation. Still need to do unit testing, but the comments should provide the gist of this approach.
static int SD_Sign(float x) {
if (x > 0.0f)
return 1;
if (x < 0.0f)
return -1;
if (atan2f(x, -1.0f) > 0.0f)
return 1;
return -1;
}
static float SD_Overflow(float num, float denom) {
return SD_Sign(num) * SD_Sign(denom) * FLT_MAX;
}
float safedivf(float num, float denom) {
float abs_denom = fabsf(denom);
// If |quotient| > |num|
if (abs_denom < 1.0f) {
float abs_num = fabsf(num);
// If |num/denom| > FLT_MAX/2 --> quotient is very large or overflows
// This computation is safe from rounding and overflow.
if (abs_num > FLT_MAX / FLT_RADIX * abs_denom) {
// If |num/denom| >= FLT_MAX*2 --> overflow
// This also catches denom == 0.0
if (abs_num / FLT_RADIX >= FLT_MAX * abs_denom) {
return SD_Overflow(num, denom);
}
// At this point, quotient must be in or near range FLT_MAX/2 to FLT_MAX*2
// Scale parameters so quotient is a FLT_RADIX * FLT_RADIX factor smaller.
if (abs_num > 1.0) {
abs_num /= FLT_RADIX * FLT_RADIX;
} else {
abs_denom *= FLT_RADIX * FLT_RADIX;
}
float quotient = abs_num / abs_denom;
if (quotient > FLT_MAX / (FLT_RADIX * FLT_RADIX)) {
return SD_Overflow(num, denom);
}
}
}
return num / denom;
}
The SIGN_F()
needs to consider in denum
is +0.0
or -0.0
. Various methods mentioned by @Pascal Cuoq in a comment:
copysign()
orsignbit()
- Use a union
Additional, some functions, when well implemented, differentiate on +/- zero like atan2f(zero, -1.0)
and sprintf(buffer, "%+f", zero)
.
Note: Used float
vs. float32_t
for simplicity.
Note: Maybe use fabsf()
rather than fabs()
.
Minor: Suggest denom
(denominator) in lieu of denum
.
回答4:
To avoid the corner cases with rounding and what not, you could massage the exponent on the divisor -- with frexp() and ldexp() -- and worry about whether the result can be scaled back without overflow. Or frexp() both arguments, and do the exponenent work by hand.
来源:https://stackoverflow.com/questions/25310051/safe-floating-point-division