ieee-754 | 易学教程

How do I save a floating-point number in 2 bytes?

阅读更多关于 How do I save a floating-point number in 2 bytes?

Yes I'm aware of the IEEE-754 half-precision standard, and yes I'm aware of the work done in the field. Put very simply, I'm trying to save a simple floating point number (like 52.1 , or 1.25 ) in just 2 bytes. I've tried some implementations in Java and in C# but they ruin the input value by decoding a different number. You feed in 32.1 and after encode-decode you get 32.0985 . Is there ANY way I can store floating point numbers in just 16-bits without ruining the input value? Thanks very much. You could store three digits in BCD and use the remaining four bits for the decimal point position:

Calculator to convert binary to float value — what am I doing wrong?

阅读更多关于 Calculator to convert binary to float value — what am I doing wrong?

I have the following code, which writes 6 floats to disk in binary form and reads them back: #include <iostream> #include <cstdio> int main() { int numSegs = 2; int numVars = 3; float * data = new float[numSegs * numVars]; for (int i = 0; i < numVars * numSegs; ++i) { data[i] = i * .23; std::cout << data[i] << std::endl; } FILE * handle = std::fopen("./sandbox.out", "wb"); long elementsWritten = std::fwrite(data, sizeof(float), numVars*numSegs, handle); if (elementsWritten != numVars*numSegs){ std::cout << "Error" << std::endl; } fclose(handle); handle = fopen("./sandbox.out", "rb"); float *

How to avoid less precise sum for numpy-arrays with multiple columns

阅读更多关于 How to avoid less precise sum for numpy-arrays with multiple columns

问题 I've always assumed, that numpy uses a kind of pairwise-summation, which ensures high precision also for float32 - operations: import numpy as np N=17*10**6 # float32-precision no longer enough to hold the whole sum print(np.ones((N,1),dtype=np.float32).sum(axis=0)) # [17000000.], kind of expected However, it looks as if a different algorithm is used if the matrix has more than one column: print(np.ones((N,2),dtype=np.float32).sum(axis=0)) # [16777216. 16777216.] the error is just to big

Parsing a double from a string which holds a value greater than Double.MaxValue

阅读更多关于 Parsing a double from a string which holds a value greater than Double.MaxValue

问题 Consider the following java code: String toParse = "1.7976931348623157E308"; //max value of a double in java double parsed = Double.parseDouble(toParse); System.out.println(parsed); For the mentioned value of 1.7976931348623157E308 everything makes sense and one gets the correct output. Now, if one tries to parse 1.7976931348623158E308 (last digit before E incremented) you still get the maximum value printed into the console! Only after trying to parse 1.7976931348623159E308 (again the last

max float represented in IEEE 754

阅读更多关于 max float represented in IEEE 754

I am wondering if the max float represented in IEEE 754 is: (1.11111111111111111111111)_b*2^[(11111111)_b-127] Here _b means binary representation. But that value is 3.403201383*10^38 , which is different from 3.402823669*10^38 , which is (1.0)_b*2^[(11111111)_b-127] and given by for example c++ <limits> . Isn't (1.11111111111111111111111)_b*2^[(11111111)_b-127] representable and larger in the framework? Does anybody know why? Thank you. The exponent 11111111 b is reserved for infinities and NaNs, so your number cannot be represented. The greatest value that can be represented in single

IEEE double such that sqrt(x*x) ≠ x

阅读更多关于 IEEE double such that sqrt(x*x) ≠ x

Does there exist an IEEE double x>0 such that sqrt(x*x) ≠ x , under the condition that the computation x*x does not overflow or underflow to Inf , 0 , or a denormal number? This is given that sqrt returns the nearest representable result, and so does x*x (both as mandated by the IEEE standard, "square root operation be calculated as if in infinite precision, and then rounded to one of the two nearest floating-point numbers of the specified precision that surround the infinitely precise result"). Under the assumption that if such doubles would exist, then there are probably examples close to 1,

Calculator to convert binary to float value — what am I doing wrong?

阅读更多关于 Calculator to convert binary to float value — what am I doing wrong?

问题 I have the following code, which writes 6 floats to disk in binary form and reads them back: #include <iostream> #include <cstdio> int main() { int numSegs = 2; int numVars = 3; float * data = new float[numSegs * numVars]; for (int i = 0; i < numVars * numSegs; ++i) { data[i] = i * .23; std::cout << data[i] << std::endl; } FILE * handle = std::fopen("./sandbox.out", "wb"); long elementsWritten = std::fwrite(data, sizeof(float), numVars*numSegs, handle); if (elementsWritten != numVars*numSegs)

Encoding and decoding IEEE 754 floats in JavaScript

阅读更多关于 Encoding and decoding IEEE 754 floats in JavaScript

I need to encode and decode IEEE 754 floats and doubles from binary in node.js to parse a network protocol. Are there any existing libraries that do this, or do I have to read the spec and implement it myself? Or should I write a C module to do it? Dobes Vandermeer Note that as of node 0.6 this functionality is included in the core library, so that is the new best way to do it. See http://nodejs.org/docs/latest/api/buffer.html for details. If you are reading/writing binary data structures you might consider using a friendly wrapper around this functionality to make things easier to read and

Math.pow with negative numbers and non-integer powers

阅读更多关于 Math.pow with negative numbers and non-integer powers

The ECMAScript specification for Math.pow has the following peculiar rule: If x < 0 and x is finite and y is finite and y is not an integer, the result is NaN. ( http://es5.github.com/#x15.8.2.13 ) As a result Math.pow(-8, 1 / 3) gives NaN rather than -2 What is the reason for this rule? Is there some sort of broader computer science or IEEEish reason for this rule, or is it just a choice TC39/Eich made once upon a time? Update Thanks to Amadan's exchanges with me, I think I understand the reasoning now. I would like to expand upon our discussion for the sake of posterity. Let's take the

Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

阅读更多关于 Does double z=x-y guarantee that z+y==x for IEEE 754 floating point?

问题 I have a problem that can be reduced to this problem statement: Given a series of doubles where each is in the range [0, 1e7] , modify the last element such that the sum of the numbers equals exactly a target number. The series of doubles already sums to the target number within an epsilon (1e-7), but they are not ==. The following code is working, but is it guaranteed to work for all inputs that meet the requirements described in the first sentence? public static double[] FixIt(double[]