In my project I have to compute division, multiplication, subtraction, addition on a matrix of double
elements.
The problem is that when the size of matrix incr
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
Floating point data types with greater precision than double
are going to depend on your compiler and architecture.
In order to get more than double
precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double
, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double
for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128
type and there is (if memory serves) a compiler option to set long double
to it.
On Intel architectures the precision of long double
is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.