I\'m working with a client who is using an old version of GCC (3.2.3 to be precise) but wants to upgrade and one reason that\'s been given as stumbling block to upgrading to a n
The "why" is that some compilers will return floating point values in a floating-point register. These registers have only one size. For example, on X86, it is 80 bits wide. The results of a function that returns a floating point value will be placed into this register regardless of whether the type has been declared as float, double, float_t or double_t. If the size of the return value and the size of the floating-point register differ, then at some point an instruction will be required to round down to the desired size.
The same kind of conversion is necessary for integers as well, but for subsequent additions and subtractions there is no overhead, because there are instructions to pick which bytes to involve in the operation. The rules for conversion of integers to a smaller size specify that the most significant bits be tossed away, so the result of downsizing can produce a result that is radically different (e.g. (short)(2147450880) --> -32768), but for some reason that seems to be OK with the programming community.
In doing a floating-point downsizing, the result is specified to be rounded to the closest representable number. If integers were subject to the same rules, then the above example would truncate thusly (short)(2147450880) -> +32767. Obviously a little more logic is required to perform such an operation that mere truncation of the upper bits. With floating-point, the exponent and the significand change sizes between float, double and long double, so it is more complicated. Additionally, there are issues of conversion between infinity, NaN, normalized numbers, and renormalized numbers that need to be taken into account. Hardware can implement these conversions in the same amount of time as an integer addition, but if the conversion needs to be implemented in software, it may take 20 instructions, which can have a noticeable effect on performance. Since the C programming model assures that the same results be generated regardless of whether the floating-point is implemented in hardware or software, the software is obliged to execute these extra instructions in order to comply with the computational model. The float_t and double_t types were designed to expose the most efficient return value type.
The compiler defines a FLT_EVAL_METHOD, which specifies how much precision is to be used in the intermediate computations. With integers, the rule is to do intermediate computations using the highest precision of the operands involved. This would correspond to a FLT_EVAL_METHOD==0. However, the original K&R specified that all intermediate computations be done in double, thus yielding FLT_EVAL_METHOD==1. However, with the introduction of the IEEE floating-point standard, it became commonplace on some platforms, notably the Macintosh PowerPC and Windows X86 to perform intermediate computations in long double -- 80 bits, thus yielding FLT_EVAL_METHOD==2.
Regression testing will be affected by the FLT_EVAL_METHOD computational model. Thus, your regression code should take this into account. One way is to test FLT_EVAL_METHOD and have different branches for each model. A similar method would be to test sizeof(float_t), and have different branches. A third method would be to use some kind of epsilon that would be used to check whether the results are close enough.
Unfortunately, there are some computations that make a decision based on the results of a computation, resulting in a true or false, which cannot be resolved by using an epsilon. This occurs in computer graphics, for example, to decide whether a point is inside or outside a polygon, which determines whether a particular pixel should be filled. If your regression involves one of these, you cannot use the epsilon method, and must use different branches depending on the computational model.
Another way to resolve the decision regression between models is to cast the result explicitly to a particular desired precision. This works most of the time on many compilers, but some compilers think that they are smarter than you, and refuse to do the conversion. This happens in the case where an intermediate result is stored in a register, but is used in a subsequent computation. You can cast away precision as much as you want in the intermediate result, but the compiler will do nothing -- unless you declare the intermediate result as volatile. This then forces the compiler to downsize and store the intermediate result in a variable of the specified size in memory, then to retrieve it when needed for computation. The IEEE floating point standard is exact for elementary operations (+-*/) and square root. I believe that sin(), cos(), exp(), log(), etc. are specified to be within 2 ULP (units in the least significant position) of the closest numerically-representable result. The long double (80 bit) format was designed to allow computation of those other transcendental functions exactly to the closest numerically-represenatble result.
This covers a lot of the issues brought up (and implied) in this thread, but does not answer the question of when you should use the float_t and double_t types. Obviously, you need to do so when interfacing to an API that uses these types, especially when passing the address of one of these types.
If your prime concern is about performance, then you might want to consider using the float_t and double_t types in your computations and APIs. But it is most probable that the performance increase that you get is neither measurable nor noticeable.
However, if you are concerned about regression between different compilers and different machines, you should probably avoid these types as much as possible, and use casting liberally to assure cross-platform compatibility.
The C99 standard says:
The types
float_t
double_t
are floating types at least as wide as float and double, respectively, and such that
double_t
is at least as wide asfloat_t
. IfFLT_EVAL_METHOD
equals0
,float_t
anddouble_t
arefloat
anddouble
, respectively; ifFLT_EVAL_METHOD
equals1
, they are bothdouble
; ifFLT_EVAL_METHOD
equals2
, they are bothlong double
; and for other values ofFLT_EVAL_METHOD
, they are otherwise implementation-defined.178)
And indeed, in previous versions of gcc they were defined as long double
by default.
The reason for float_t is that for some processors and compilers using a larger type e.g. long double for float could be more efficient and so the float_t allows the compiler to use the larger type instead of float.
thus in the OPs case using float_t the change in size is what the standard allows for. If the original code wanted to use the smaller float sizes it should be using float.
There is some rationale in open-std doc
for example the type definitions float_t and double_t (defined in <math.h>), are intended to allow effective use of architectures with more efficient, wider formats. Annexes