ieee-754

How do I save a floating-point number in 2 bytes?

别来无恙 提交于 2019-12-01 05:26:05
Yes I'm aware of the IEEE-754 half-precision standard, and yes I'm aware of the work done in the field. Put very simply, I'm trying to save a simple floating point number (like 52.1 , or 1.25 ) in just 2 bytes. I've tried some implementations in Java and in C# but they ruin the input value by decoding a different number. You feed in 32.1 and after encode-decode you get 32.0985 . Is there ANY way I can store floating point numbers in just 16-bits without ruining the input value? Thanks very much. You could store three digits in BCD and use the remaining four bits for the decimal point position:

Calculator to convert binary to float value — what am I doing wrong?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-01 05:16:10
I have the following code, which writes 6 floats to disk in binary form and reads them back: #include <iostream> #include <cstdio> int main() { int numSegs = 2; int numVars = 3; float * data = new float[numSegs * numVars]; for (int i = 0; i < numVars * numSegs; ++i) { data[i] = i * .23; std::cout << data[i] << std::endl; } FILE * handle = std::fopen("./sandbox.out", "wb"); long elementsWritten = std::fwrite(data, sizeof(float), numVars*numSegs, handle); if (elementsWritten != numVars*numSegs){ std::cout << "Error" << std::endl; } fclose(handle); handle = fopen("./sandbox.out", "rb"); float *

How to avoid less precise sum for numpy-arrays with multiple columns

折月煮酒 提交于 2019-12-01 04:55:29
问题 I've always assumed, that numpy uses a kind of pairwise-summation, which ensures high precision also for float32 - operations: import numpy as np N=17*10**6 # float32-precision no longer enough to hold the whole sum print(np.ones((N,1),dtype=np.float32).sum(axis=0)) # [17000000.], kind of expected However, it looks as if a different algorithm is used if the matrix has more than one column: print(np.ones((N,2),dtype=np.float32).sum(axis=0)) # [16777216. 16777216.] the error is just to big

Parsing a double from a string which holds a value greater than Double.MaxValue

浪子不回头ぞ 提交于 2019-12-01 04:42:44
问题 Consider the following java code: String toParse = "1.7976931348623157E308"; //max value of a double in java double parsed = Double.parseDouble(toParse); System.out.println(parsed); For the mentioned value of 1.7976931348623157E308 everything makes sense and one gets the correct output. Now, if one tries to parse 1.7976931348623158E308 (last digit before E incremented) you still get the maximum value printed into the console! Only after trying to parse 1.7976931348623159E308 (again the last

max float represented in IEEE 754

妖精的绣舞 提交于 2019-12-01 04:25:36
I am wondering if the max float represented in IEEE 754 is: (1.11111111111111111111111)_b*2^[(11111111)_b-127] Here _b means binary representation. But that value is 3.403201383*10^38 , which is different from 3.402823669*10^38 , which is (1.0)_b*2^[(11111111)_b-127] and given by for example c++ <limits> . Isn't (1.11111111111111111111111)_b*2^[(11111111)_b-127] representable and larger in the framework? Does anybody know why? Thank you. The exponent 11111111 b is reserved for infinities and NaNs, so your number cannot be represented. The greatest value that can be represented in single

IEEE double such that sqrt(x*x) ≠ x

那年仲夏 提交于 2019-12-01 03:00:26
Does there exist an IEEE double x>0 such that sqrt(x*x) ≠ x , under the condition that the computation x*x does not overflow or underflow to Inf , 0 , or a denormal number? This is given that sqrt returns the nearest representable result, and so does x*x (both as mandated by the IEEE standard, "square root operation be calculated as if in infinite precision, and then rounded to one of the two nearest floating-point numbers of the specified precision that surround the infinitely precise result"). Under the assumption that if such doubles would exist, then there are probably examples close to 1,

Calculator to convert binary to float value — what am I doing wrong?

蹲街弑〆低调 提交于 2019-12-01 02:00:28
问题 I have the following code, which writes 6 floats to disk in binary form and reads them back: #include <iostream> #include <cstdio> int main() { int numSegs = 2; int numVars = 3; float * data = new float[numSegs * numVars]; for (int i = 0; i < numVars * numSegs; ++i) { data[i] = i * .23; std::cout << data[i] << std::endl; } FILE * handle = std::fopen("./sandbox.out", "wb"); long elementsWritten = std::fwrite(data, sizeof(float), numVars*numSegs, handle); if (elementsWritten != numVars*numSegs)

Encoding and decoding IEEE 754 floats in JavaScript

十年热恋 提交于 2019-11-30 19:13:12
I need to encode and decode IEEE 754 floats and doubles from binary in node.js to parse a network protocol. Are there any existing libraries that do this, or do I have to read the spec and implement it myself? Or should I write a C module to do it? Dobes Vandermeer Note that as of node 0.6 this functionality is included in the core library, so that is the new best way to do it. See http://nodejs.org/docs/latest/api/buffer.html for details. If you are reading/writing binary data structures you might consider using a friendly wrapper around this functionality to make things easier to read and

Math.pow with negative numbers and non-integer powers

早过忘川 提交于 2019-11-30 13:02:28
The ECMAScript specification for Math.pow has the following peculiar rule: If x < 0 and x is finite and y is finite and y is not an integer, the result is NaN. ( http://es5.github.com/#x15.8.2.13 ) As a result Math.pow(-8, 1 / 3) gives NaN rather than -2 What is the reason for this rule? Is there some sort of broader computer science or IEEEish reason for this rule, or is it just a choice TC39/Eich made once upon a time? Update Thanks to Amadan's exchanges with me, I think I understand the reasoning now. I would like to expand upon our discussion for the sake of posterity. Let's take the

Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

隐身守侯 提交于 2019-11-30 09:23:30
问题 I have a problem that can be reduced to this problem statement: Given a series of doubles where each is in the range [0, 1e7] , modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==. The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence? public static double[] FixIt(double[]