ieee-754

IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?

十年热恋 提交于 2019-12-05 01:37:42
Assume that t , a , b are all double (IEEE Std 754) variables, and both values of a , b are NOT NaN (but may be Inf ). After t = a - b , do I necessarily have a == b + t ? Absolutely not. One obvious case is a=DBL_MAX , b=-DBL_MAX . Then t=INFINITY , so b+t is also INFINITY . What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b is inexact. For example, if a is DBL_EPSILON/4 and b is -1 , a-b is 1 (assuming default rounding mode), and a-b+b is then 0. The reason I mention this second example is that this is the

Number of floats between two floats

扶醉桌前 提交于 2019-12-05 01:31:23
Say I have two Python floats a and b , is there an easy way to find out how many representable real numbers are between the two in IEEE-754 representation (or whatever representation the machine used is using)? I don'tknow what you will be using this for - but, if both floats have the same exponent, it should be possible. As the exponent is kept on the high order bits, loading the float bytes (8 bytes in this case) as an integer and subtracting one from another should give the number you want. I use the struct model to pack the floats to a binary representation, and then unpack those as (C, 8

`std::sin` is wrong in the last bit

我与影子孤独终老i 提交于 2019-12-05 01:26:27
I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**). I am facing different results for this operation: std::sin(0.497418836818383950) = 0.477158760259608410 (C++) sin(0.497418836818383950) = 0.47715876025960846000 (Matlab) N[Sin[0.497418836818383950], 20] = 0.477158760259608433 (Mathematica) So, as far as I know both C++ and Matlab are using IEEE754 defined double arithmetic. I think I have read somewhere that IEEE754 allows differents results in the last bit. Using mathematica to decide, seems like C++ is

Layman's explanation for why JavaScript has weird floating math – IEEE 754 standard [duplicate]

旧时模样 提交于 2019-12-05 01:16:57
问题 This question already has answers here : Is floating point math broken? (31 answers) Closed 5 years ago . I never understand exactly what's going on with JavaScript when I do mathematical operations on floating point numbers. I've been down-right fearful of using decimals, to the point where I just avoid them when at all possible. However, if I knew what was going on behind the scenes when it comes to the IEEE 754 standard, then I would be able to predict what would happen; with

Why is IEEE-754 Floating Point not exchangable between platforms?

空扰寡人 提交于 2019-12-04 22:21:01
问题 It has been asserted that (even accounting for byte endian-ness) IEEE754 floating point is not guaranteed to be exchangeable between platforms. So: Why, theoretically, is IEEE floating point not exchangeable between platforms? Are any of these concerns valid for modern hardware platforms (e.g. i686, x64, arm)? If the concerns are valid, can you please demonstrate an example where this is the case (C or C++ is preferred)? Motivation: Several GPS manufacturers exchange their binary formats for

IEEE float hex 424ce027 to float?

两盒软妹~` 提交于 2019-12-04 18:45:30
If I have a IEEE float hex 424ce027, how do I convert it to decimal? unsigned char ptr[] = {0x42,0x4c,0xe0,0x27}; how do ? float tmp = 51.218899; Perhaps... float f = *reinterpret_cast<float*>(ptr); Although on my x86 machine here I had to also reverse the byte order of the character to get the value you wanted. std::reverse(ptr, ptr + 4); float f = *reinterpret_cast<float*>(ptr); You might want to use sizeof(float) instead of 4 or some other way to get the size. You might want to reverse a copy of the bytes, not the original. It's somewhat ugly however you do it. edit : As pointed out in the

Is there any IEEE 754 standard implementations for Java floating point primitives?

左心房为你撑大大i 提交于 2019-12-04 18:33:51
问题 I'm interested if Java is using IEEE 754 standard for implementing its floating point arithmetic. Here I saw this kind of thing in documentation: operation defined in IEEE 754-2008 As I understand positive side of IEEE 754 is to increase precision of floating point arithmetics so if I'll use double or float in Java would presision of computations be same as in BigDecimal ? And if not than what's the point of using IEEE 754 standard in Math class? 回答1: I'm interested if Java is using IEEE 754

IEEE - 754 - find signbit, exponent, frac, normalized, etc

99封情书 提交于 2019-12-04 17:18:24
I am taking in a 8 digit hexadecimal number as an IEEE 754 bit floating point number and i want to print information about that number( signbit, expbits, fractbits, normalized, denormalized, infinity, zero, NAN) floating point should be a single. I read up on bit shifting, and i think this is how i am suppose to do it?. however, i am not 100% sure. I understand that the sign bit is found in the left most position of the number. which indicates positive or negative. How much do i shift it to find each? do i just keep shifting it to find each one? Can someone explain how i am to find each one?

Do-s and Don't-s for floating point arithmetic?

↘锁芯ラ 提交于 2019-12-04 17:14:54
问题 What are some good do-s and don't-s for floating point arithmetic (IEEE754 in case there's confusion) to ensure good numerical stability and high accuracy in your results? I know a few like don't subtract quantities of similar magnitude, but I'm curious what other good rules are out there. 回答1: First, enter with the notion that floating point numbers do NOT necessarily follow the same rules as real numbers... once you have accepted this, you will understand most of the pitfalls. Here's some

Are there any whole numbers which the double cannot represent within the MIN/MAX range of a double?

喜夏-厌秋 提交于 2019-12-04 16:43:24
问题 I realize that whenever one is dealing with IEEE 754 doubles and floats, some numbers can't be represented especially when one tries to represent numbers with lots of digits after the decimal point. This is well understood but I was curious if there were any whole numbers within the MIN/MAX range of a double (or float) that couldn't be represented and thus needed to be rounded to the nearest representable IEEE 754 representation? For instance very large numbers are sometimes represented in