Arithmetic precision with doubles in Matlab

前端 未结 3 1068
礼貌的吻别
礼貌的吻别 2021-01-19 02:18

I am having a bit of trouble understanding how the precision of these doubles affects the outcome of arithmetic operations in Matlab. I thought that since both a & b are

相关标签:
3条回答
  • 2021-01-19 02:27

    "Floating" point means just that--the precision is relative to the scale of the number itself.

    In the specific example you gave, 1.22e-45 can be represented alone because the exponent can be adjusted to represent 10^-45, or approximately 2^-150.

    On the other hand, 1.0 is represented in binary with scale 2^0 (i.e., 1).

    To add these two values, you need to align their decimal points (er...binary points), meaning that all of the precision of 1.22e-45 is shifted 150-odd bits to the right.

    Of course, IEEE double precision floating point values only have 53 bits of mantissa (precision), meaning that at the scale of 1.0, 1.22e-45 is effectively zero.

    0 讨论(0)
  • 2021-01-19 02:29

    64-bit IEEE-754 floating point numbers have enough precision (with a 53 bit mantissa) to represent about 16 significant decimal digits. But it requires more like 45 significant decimal digits to tell the difference between (1+a) = 1.00....000122 and 1.000 for your example.

    0 讨论(0)
  • 2021-01-19 02:43

    To add to what the other answers have said, you can use the MATLAB function EPS to visualize the precision issue you are running into. For a given double-precision floating-point number, the function EPS will tell you the distance from it to the next largest representable floating point number:

    >> a = 1.22e-45;
    >> b = 1;
    >> eps(b)
    
    ans =
    
      2.2204e-016
    

    Note that the next floating point number that is larger than 1 is 1.00000000000000022204..., and the value of a doesn't even come close to half the distance between the two numbers. Hence a+b ends up staying 1.

    Incidentally, you can also see why a is considered non-zero even though it is so small by looking at the smallest representable double-precision floating-point value using the function REALMIN:

    >> realmin
    
    ans =
    
      2.2251e-308  %# MUCH smaller than a!
    
    0 讨论(0)
提交回复
热议问题