ieee-754

Lua - packing IEEE754 single-precision floating-point numbers

╄→гoц情女王★ 提交于 2019-11-30 09:08:36
I want to make a function in pure Lua that generates a fraction (23 bits), an exponent (8 bits), and a sign (1 bit) from a number, so that the number is approximately equal to math.ldexp(fraction, exponent - 127) * (sign == 1 and -1 or 1) , and then packs the generated values into 32 bits. A certain function in the math library caught my attention: The frexp function breaks down the floating-point value (v) into a mantissa (m) and an exponent (n), such that the absolute value of m is greater than or equal to 0.5 and less than 1.0, and v = m * 2^n. Note that math.ldexp is the inverse operation.

Reading 32 bit signed ieee 754 floating points from a binary file with python?

Deadly 提交于 2019-11-30 06:47:05
I have a binary file which is simple a list of signed 32 bit ieee754 floating point numbers. They are not separated by anything, and simply appear one after another until EOF. How would I read from this file and interpret them correctly as floating point numbers? I tried using read(4) , but it automatically converts them to a string with ascii encoding. I also tried using bytearray but that only takes it in 1 byte at a time instead of 4 bytes at a time as I need. struct.unpack('f', file.read(4)) You can also unpack several at once, which will be faster: struct.unpack('f'*n, file.read(4*n))

Go float comparison [duplicate]

眉间皱痕 提交于 2019-11-30 05:33:09
This question already has an answer here: Is floating point math broken? 30 answers In order to compare two floats (float64) for equality in Go, my superficial understanding of IEEE 754 and binary representation of floats makes me think that this is a good solution: func Equal(a, b float64) bool { ba := math.Float64bits(a) bb := math.Float64bits(b) diff := ba - bb if diff < 0 { diff = -diff } // accept one bit difference return diff < 2 } The question is: Is this a more generic, more precise, and more efficient, way to compare two arbitrarily large or small floats for "almost equalness", than

How do I save a floating-point number in 2 bytes?

社会主义新天地 提交于 2019-11-30 05:25:11
问题 Yes I'm aware of the IEEE-754 half-precision standard, and yes I'm aware of the work done in the field. Put very simply, I'm trying to save a simple floating point number (like 52.1 , or 1.25 ) in just 2 bytes. I've tried some implementations in Java and in C# but they ruin the input value by decoding a different number. You feed in 32.1 and after encode-decode you get 32.0985 . Is there ANY way I can store floating point numbers in just 16-bits without ruining the input value? Thanks very

How cross-platform is Google's Protocol Buffer's handling of floating-point types in practice?

情到浓时终转凉″ 提交于 2019-11-30 04:42:25
Google's Protocol Buffers allows you to store floats and doubles in messages. I looked through the implementation source code wondering how they managed to do this in a cross-platform manner, and what I stumbled upon was: inline uint32 WireFormatLite::EncodeFloat(float value) { union {float f; uint32 i;}; f = value; return i; } inline float WireFormatLite::DecodeFloat(uint32 value) { union {float f; uint32 i;}; i = value; return f; } inline uint64 WireFormatLite::EncodeDouble(double value) { union {double f; uint64 i;}; f = value; return i; } inline double WireFormatLite::DecodeDouble(uint64

In binary notation, what is the meaning of the digits after the radix point “.”?

若如初见. 提交于 2019-11-30 03:07:00
I have this example on how to convert from a base 10 number to IEEE 754 float representation Number: 45.25 (base 10) = 101101.01 (base 2) Sign: 0 Normalized form N = 1.0110101 * 2^5 Exponent esp = 5 E = 5 + 127 = 132 (base 10) = 10000100 (base 2) IEEE 754: 0 10000100 01101010000000000000000 This makes sense to me except one passage: 45.25 (base 10) = 101101.01 (base 2) 45 is 101101 in binary and that's okay.. but how did they obtain the 0.25 as .01 ? You can convert the part after the decimal point to another base by repeatedly multiplying by the new base (in this case the new base is 2), like

Why is pow(-infinity, positive non-integer) +infinity?

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-30 01:19:25
问题 C99 annex F (IEEE floating point support) says this: pow(−∞, y) returns +∞ for y > 0 and not an odd integer. But, say, (−∞) 0.5 actually has the imaginary values ±∞i, not +∞. C99’s own sqrt(−∞) returns a NaN and generates a domain error as expected. Why then is pow required to return +∞? (Most other languages use the C library directly or, like Python in this case, copy the behaviour required of it by standards, so in practice this affects more than just C99.) 回答1: For odd integer y , it

How does this float square root approximation work?

天大地大妈咪最大 提交于 2019-11-29 22:05:13
I found a rather strange but working square root approximation for float s; I really don't get it. Can someone explain me why this code works? float sqrt(float f) { const int result = 0x1fbb4000 + (*(int*)&f >> 1); return *(float*)&result; } I've test it a bit and it outputs values off of std::sqrt() by about 1 to 3% . I know of the Quake III's fast inverse square root and I guess it's something similar here (without the newton iteration) but I'd really appreciate an explanation of how it works . (nota: I've tagged it both c and c++ since it's both valid-ish (see comments) C and C++ code)

float128 and double-double arithmetic

白昼怎懂夜的黑 提交于 2019-11-29 18:14:58
I've seen in wikipedia that someway to implement quad-precision is to use double-double arithmetic even if it's not exactly the same precision in terms of bits: https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format In this case, we use two double to store the value. So we make two operations to compute the result, one for each double of the result. In this case we can have round-off errors on each double or their is a mechanism that avoid this? “In this case, we use two double to store the value. So we need to make two operations at each time.” This is not how double-double

Fastest algorithm to identify the smallest and largest x that make the double-precision equation x + a == b true

五迷三道 提交于 2019-11-29 17:54:31
问题 In the context of static analysis, I am interested in determining the values of x in the then-branch of the conditional below: double x; x = …; if (x + a == b) { … a and b can be assumed to be double-precision constants (generalizing to arbitrary expressions is the easiest part of the problem), and the compiler can be assumed to follow IEEE 754 strictly ( FLT_EVAL_METHOD is 0). The rounding mode at run-time can be assumed to be to nearest-even. If computing with rationals was cheap, it would