ieee-754

IEEE-754 32 Bit (single precision) exponent -126 instead of -127

橙三吉。 提交于 2019-12-10 22:38:27
问题 I know if I have a number like that: 1 | 1001 0001 | 0011 0011 0000 0001 0101 000 1 sign bit | 8 bit biased exponent | 23 bit fraction/mantissa I can calculate the "real" exponent by subtraction the bias 127 (0111 1111) from biased exponent. I.e. 1001 0001 - 0111 1111 = 10010 (so real Exponent is 18) 1,0011 0011 0000 0001 0101 000 * 2^18 So now my question: If a have a (denormalized) number like that: 0 | 0000 0000 | 0000 0000 0000 0000 0000 001 Why the Exponent is -126 and not -127? 0000

Positive/Negative Infinity Constants in Fortran

孤街浪徒 提交于 2019-12-10 21:04:34
问题 How could I get constants (or parameter s, I suppose) that are negative and positive infinity in Fortran 2008? I tried the following code: program inf use, intrinsic :: ieee_arithmetic real(8), parameter :: inf_pos = ieee_value(0d0, ieee_positive_inf) real(8), parameter :: inf_neg = ieee_value(0d0, ieee_negative_inf) end program inf However, I get the following errors: $ gfortran inf.f08 inf.f08:4:22: real(8) :: inf_pos = ieee_value(0d0, ieee_positive_inf) 1 Error: Function ‘ieee_value’ in

How MySQL does the math calculation of floating point addition?

给你一囗甜甜゛ 提交于 2019-12-10 20:59:38
问题 I tested with SELECT 0.1 + 0.2; , queried with MySQL (MariaDB), and it returned the right answer MariaDB [(none)]> SELECT 0.1 + 0.2; +-----------+ | 0.1 + 0.2 | +-----------+ | 0.3 | +-----------+ 1 row in set (0.000 sec) Floating point calculation is inaccurate in most programming languages because of IEEE 754 as explained here. How MySQL does the floating point calculation that makes it return the right answer? 回答1: I know SQL 92 is old standard but iám pretty sure this is not changed in

Decimal to binary Half-Precision IEEE 754 in Python

烈酒焚心 提交于 2019-12-10 19:17:21
问题 I was only able to convert a decimal into a binary single-precision IEEE754, using the struct.pack module, or do the opposite (float16 or float32) using numpy.frombuffer Is it possible to convert a decimal to a binary half precision floating point, using Numpy? I need to print the result of the conversion, so if I type "117.0" , it should print "0101011101010000" 回答1: if I type "117.0", it should print "0101011101010000" >>> import numpy as np >>> bin(np.float16(117.0).view('H'))[2:].zfill(16

quickly find the integer part of the base 2 logarithm

試著忘記壹切 提交于 2019-12-10 15:31:02
问题 What is an efficient method to calculate the integer part of the base 2 logarithm of a floating point number? Something like N = ceil( log2( f )) or N = floor( log2( f )) for floating point f. I guess this is possible to realize very efficiently somehow as one probably only needs access to the floating point exponent. EDIT2: I am not primarily interested in exactness . I could tolerate an error of +-1. I listed the two variants just as an example because one might be computationally cheaper

Inhowfar do IEEE754 floats satisfy LessThanComparable?

◇◆丶佛笑我妖孽 提交于 2019-12-10 14:35:40
问题 TL;DR Do the IEEE754 floating point values including NaN satisfy LessThanComparable ? Specifically, question "Why does Release/Debug have a different result for std::min?" got me looking up LessThanComparable: The type must work with < operator and the result should have standard semantics. Requirements The type T satisfies LessThanComparable if Given a, b, and c, expressions of type T or const T The following expressions must be valid and have their specified effects Establishes strict weak

Why console.log shows only part of the number resulting from 0.1+0.2=0.30000000000000004

≯℡__Kan透↙ 提交于 2019-12-10 14:16:28
问题 This question wasn't asked on stackoverlow yet! I'm not asking why 0.1+0.2 doesn't equal 0.3, I'm asking very different thing! Please read the question before marking it as a duplicate. I've written this function that shows how JavaScript stores float numbers in 64 bits: function to64bitFloat(number) { var f = new Float64Array(1); f[0] = number; var view = new Uint8Array(f.buffer); var i, result = ""; for (i = view.length - 1; i >= 0; i--) { var bits = view[i].toString(2); if (bits.length < 8

Interpreting a 32bit unsigned long as Single Precision IEEE-754 Float in C

試著忘記壹切 提交于 2019-12-10 12:46:50
问题 I am using the XC32 compiler from Microchip, which is based on the standard C compiler. I am reading a 32bit value from a device on a RS485 network and storing this in a unsigned long that I have typedef'ed as DWORD. i.e. typedef DWORD unsigned long; As it stands, when I typecast this value to a float, the value I get is basically the floating point version of it's integer representation and not the proper IEEE-754 interpreted float. i.e. DWORD dword_value = readValueOnRS485(); float temp =

python: unpack IBM 32-bit float point

◇◆丶佛笑我妖孽 提交于 2019-12-10 12:34:53
问题 I was reading a binary file in python like this: from struct import unpack ns = 1000 f = open("binary_file", 'rb') while True: data = f.read(ns * 4) if data == '': break unpacked = unpack(">%sf" % ns, data) print str(unpacked) when I realized unpack(">f", str) is for unpacking IEEE floating point, my data is IBM 32-bit float point numbers My question is: How can I impliment my unpack to unpack IBM 32-bit float point type numbers? I don't mind using like ctypes to extend python to get better

Why does this loop never end? [duplicate]

纵饮孤独 提交于 2019-12-10 12:29:10
问题 This question already has answers here : Closed 9 years ago . Possible Duplicate: problem in comparing double values in C# I've read it elsewhere, but really forget the answer so I ask here again. This loop seems never end regardless you code it in any language (I test it in C#, C++, Java...): double d = 2.0; while(d != 0.0){ d = d - 0.2; } 回答1: Floating point calculations are not perfectly precise. You will get a representation error because 0.2 doesn't have an exact representation as a