ieee-754 | 易学教程

IEEE-754 32 Bit (single precision) exponent -126 instead of -127

阅读更多关于 IEEE-754 32 Bit (single precision) exponent -126 instead of -127

问题 I know if I have a number like that: 1 | 1001 0001 | 0011 0011 0000 0001 0101 000 1 sign bit | 8 bit biased exponent | 23 bit fraction/mantissa I can calculate the "real" exponent by subtraction the bias 127 (0111 1111) from biased exponent. I.e. 1001 0001 - 0111 1111 = 10010 (so real Exponent is 18) 1,0011 0011 0000 0001 0101 000 * 2^18 So now my question: If a have a (denormalized) number like that: 0 | 0000 0000 | 0000 0000 0000 0000 0000 001 Why the Exponent is -126 and not -127? 0000

Positive/Negative Infinity Constants in Fortran

阅读更多关于 Positive/Negative Infinity Constants in Fortran

问题 How could I get constants (or parameter s, I suppose) that are negative and positive infinity in Fortran 2008? I tried the following code: program inf use, intrinsic :: ieee_arithmetic real(8), parameter :: inf_pos = ieee_value(0d0, ieee_positive_inf) real(8), parameter :: inf_neg = ieee_value(0d0, ieee_negative_inf) end program inf However, I get the following errors: $ gfortran inf.f08 inf.f08:4:22: real(8) :: inf_pos = ieee_value(0d0, ieee_positive_inf) 1 Error: Function ‘ieee_value’ in

How MySQL does the math calculation of floating point addition?

阅读更多关于 How MySQL does the math calculation of floating point addition?

问题 I tested with SELECT 0.1 + 0.2; , queried with MySQL (MariaDB), and it returned the right answer MariaDB [(none)]> SELECT 0.1 + 0.2; +-----------+ | 0.1 + 0.2 | +-----------+ | 0.3 | +-----------+ 1 row in set (0.000 sec) Floating point calculation is inaccurate in most programming languages because of IEEE 754 as explained here. How MySQL does the floating point calculation that makes it return the right answer? 回答1: I know SQL 92 is old standard but iám pretty sure this is not changed in

Decimal to binary Half-Precision IEEE 754 in Python

阅读更多关于 Decimal to binary Half-Precision IEEE 754 in Python

问题 I was only able to convert a decimal into a binary single-precision IEEE754, using the struct.pack module, or do the opposite (float16 or float32) using numpy.frombuffer Is it possible to convert a decimal to a binary half precision floating point, using Numpy? I need to print the result of the conversion, so if I type "117.0" , it should print "0101011101010000" 回答1: if I type "117.0", it should print "0101011101010000" >>> import numpy as np >>> bin(np.float16(117.0).view('H'))[2:].zfill(16

quickly find the integer part of the base 2 logarithm

阅读更多关于 quickly find the integer part of the base 2 logarithm

问题 What is an efficient method to calculate the integer part of the base 2 logarithm of a floating point number? Something like N = ceil( log2( f )) or N = floor( log2( f )) for floating point f. I guess this is possible to realize very efficiently somehow as one probably only needs access to the floating point exponent. EDIT2: I am not primarily interested in exactness . I could tolerate an error of +-1. I listed the two variants just as an example because one might be computationally cheaper

Inhowfar do IEEE754 floats satisfy LessThanComparable?

阅读更多关于 Inhowfar do IEEE754 floats satisfy LessThanComparable?

问题 TL;DR Do the IEEE754 floating point values including NaN satisfy LessThanComparable ? Specifically, question "Why does Release/Debug have a different result for std::min?" got me looking up LessThanComparable: The type must work with < operator and the result should have standard semantics. Requirements The type T satisfies LessThanComparable if Given a, b, and c, expressions of type T or const T The following expressions must be valid and have their specified effects Establishes strict weak

Why console.log shows only part of the number resulting from 0.1+0.2=0.30000000000000004

阅读更多关于 Why console.log shows only part of the number resulting from 0.1+0.2=0.30000000000000004

问题 This question wasn't asked on stackoverlow yet! I'm not asking why 0.1+0.2 doesn't equal 0.3, I'm asking very different thing! Please read the question before marking it as a duplicate. I've written this function that shows how JavaScript stores float numbers in 64 bits: function to64bitFloat(number) { var f = new Float64Array(1); f[0] = number; var view = new Uint8Array(f.buffer); var i, result = ""; for (i = view.length - 1; i >= 0; i--) { var bits = view[i].toString(2); if (bits.length < 8

Interpreting a 32bit unsigned long as Single Precision IEEE-754 Float in C

阅读更多关于 Interpreting a 32bit unsigned long as Single Precision IEEE-754 Float in C

问题 I am using the XC32 compiler from Microchip, which is based on the standard C compiler. I am reading a 32bit value from a device on a RS485 network and storing this in a unsigned long that I have typedef'ed as DWORD. i.e. typedef DWORD unsigned long; As it stands, when I typecast this value to a float, the value I get is basically the floating point version of it's integer representation and not the proper IEEE-754 interpreted float. i.e. DWORD dword_value = readValueOnRS485(); float temp =

python: unpack IBM 32-bit float point

阅读更多关于 python: unpack IBM 32-bit float point

问题 I was reading a binary file in python like this: from struct import unpack ns = 1000 f = open("binary_file", 'rb') while True: data = f.read(ns * 4) if data == '': break unpacked = unpack(">%sf" % ns, data) print str(unpacked) when I realized unpack(">f", str) is for unpacking IEEE floating point, my data is IBM 32-bit float point numbers My question is: How can I impliment my unpack to unpack IBM 32-bit float point type numbers? I don't mind using like ctypes to extend python to get better

Why does this loop never end? [duplicate]

阅读更多关于 Why does this loop never end? [duplicate]

问题 This question already has answers here : Closed 9 years ago . Possible Duplicate: problem in comparing double values in C# I've read it elsewhere, but really forget the answer so I ask here again. This loop seems never end regardless you code it in any language (I test it in C#, C++, Java...): double d = 2.0; while(d != 0.0){ d = d - 0.2; } 回答1: Floating point calculations are not perfectly precise. You will get a representation error because 0.2 doesn't have an exact representation as a