ieee-754

max float represented in IEEE 754

旧巷老猫 提交于 2019-12-30 08:59:32
问题 I am wondering if the max float represented in IEEE 754 is: (1.11111111111111111111111)_b*2^[(11111111)_b-127] Here _b means binary representation. But that value is 3.403201383*10^38 , which is different from 3.402823669*10^38 , which is (1.0)_b*2^[(11111111)_b-127] and given by for example c++ <limits> . Isn't (1.11111111111111111111111)_b*2^[(11111111)_b-127] representable and larger in the framework? Does anybody know why? Thank you. 回答1: The exponent 11111111 b is reserved for infinities

Minimum and maximum of signed zero

蹲街弑〆低调 提交于 2019-12-30 06:16:18
问题 I am concerned about the following cases min(-0.0,0.0) max(-0.0,0.0) minmag(-x,x) maxmag(-x,x) According to Wikipedia IEEE 754-2008 says in regards to min and max The min and max operations are defined but leave some leeway for the case where the inputs are equal in value but differ in representation. In particular: min(+0,−0) or min(−0,+0) must produce something with a value of zero but may always return the first argument. I did some tests compare fmin , fmax , min and max as defined below

Lua - packing IEEE754 single-precision floating-point numbers

故事扮演 提交于 2019-12-30 03:13:05
问题 I want to make a function in pure Lua that generates a fraction (23 bits), an exponent (8 bits), and a sign (1 bit) from a number, so that the number is approximately equal to math.ldexp(fraction, exponent - 127) * (sign == 1 and -1 or 1) , and then packs the generated values into 32 bits. A certain function in the math library caught my attention: The frexp function breaks down the floating-point value (v) into a mantissa (m) and an exponent (n), such that the absolute value of m is greater

Implementing single-precision division as double-precision multiplication

坚强是说给别人听的谎言 提交于 2019-12-29 08:04:26
问题 Question For a C99 compiler implementing exact IEEE 754 arithmetic, do values of f , divisor of type float exist such that f / divisor != (float)(f * (1.0 / divisor)) ? EDIT: By “implementing exact IEEE 754 arithmetic” I mean a compiler that rightfully defines FLT_EVAL_METHOD as 0. Context A C compiler that provides IEEE 754-compliant floating-point can only replace a single-precision division by a constant by a single-precision multiplication by the inverse if said inverse is itself

What is long double on x86-64?

别说谁变了你拦得住时间么 提交于 2019-12-29 06:21:41
问题 Someone told me that: Under x86-64, FP arithmetic is done with SSE, and therefore long double is 64 bits. But in the x86-64 ABI it says that: C type | sizeof | alignment | AMD64 Architecture long double | 16 | 16 | 80-bit extended (IEEE-754) See: amd64-abi.pdf and gcc says sizeof(long double) is 16 and gives FLT_DBL = 1.79769e+308 and FLT_LDBL = 1.18973e+4932 So I'm confused, how is long double 64 bit? I thought it is an 80-bit representation. 回答1: Under x86-64, FP arithmetic is done with SSE

Go float comparison [duplicate]

无人久伴 提交于 2019-12-28 12:44:48
问题 This question already has answers here : Is floating point math broken? (31 answers) Closed 2 years ago . In order to compare two floats (float64) for equality in Go, my superficial understanding of IEEE 754 and binary representation of floats makes me think that this is a good solution: func Equal(a, b float64) bool { ba := math.Float64bits(a) bb := math.Float64bits(b) diff := ba - bb if diff < 0 { diff = -diff } // accept one bit difference return diff < 2 } The question is: Is this a more

Matlab vs C++ Double Precision

纵饮孤独 提交于 2019-12-28 06:53:28
问题 I am porting some code from Matlab to C++. In Matlab format long D = 0.689655172413793 (this is 1.0 / 1.45) E = 2600 / D // I get E = 3.770000000000e+03 In C++ double D = 0.68965517241379315; //(this is 1.0 / 1.45) double E = 2600 / D; //I get E = 3769.9999999999995 It is a problem for me because in both cases I have to do rounding down to 0 (Matlab's fix), and in the first case (Matlab) is becomes 3770, whereas in the second case (C++) it becomes 3769. I realise that it is because of the two

IEEE 754 floating point arithmetic rounding error in c# and javascript

断了今生、忘了曾经 提交于 2019-12-28 06:51:12
问题 I just read a book about javascript. The author mentioned a floating point arithmetic rounding error in the IEEE 754 standard. For example adding 0.1 and 0.2 yields 0.30000000000000004 instead of 0.3. so (0.1 + 0.2) == 0.3 returns false. I also reproduced this error in c#. So these are my question is: How often this error occurs? What is the best practice workaround in c# and javascript? Which other languages have the same error? 回答1: It's not an error in the language. It's not an error in

Java Double increment

时间秒杀一切 提交于 2019-12-25 06:57:14
问题 I have a double var public double votes(){ double votexp = 0; for(Elettore e:docenti.values()){ if(e.getVoto()==true) //everytime this is true increment by 1 { votexp+=1.0; } } for(Elettore e:studenti.values()){ if(e.getVoto()==true) //everytime this is true increment by 0.2 { votexp+=0.2; } } for(Elettore e:pta.values()){ if(e.getVoto()==true) //everytime this is true increment by 0.2 { votexp+=0.2; } } return votexp; } In my case the variable shoud be incremented to 2.6 but votexp returns 2

How to weigh up calculation error

試著忘記壹切 提交于 2019-12-25 03:12:31
问题 Consider the following example. There is an image where user can select rectangular area (part of it). The image is displayed with some scale. Then we change the scale and we need to recalculate the new coordinates of selection. Let's take width, newSelectionWidth = round(oldSelectionWidth / oldScale * newScale) where oldScale = oldDisplayImageWidth / realImageWidth , newScale = newDisplayImageWidth / realImageWidth , all the values except for scales are integers. The question is how to prove