floating-point | 易学教程

Instruction FYL2XP1

阅读更多关于 Instruction FYL2XP1

问题 I'm wondering why the instruction FYL2XP1 on x86-architecture computes exactly the mathematical formula y · log 2 ( x + 1). What's special with this formula? 回答1: The y operand is usually a compile time constant, for the moment forget about the x + 1 . Since log_b(x) = log_b(2) * log_2(x) the instruction allows to compute the logarithm in any base of x + 1 . Note that log_b(2) is a constant since it is seldom necessary to compute the logarithm with a degree of freedom in the base. FYL2XP1 and

Is floating point math broken?

阅读更多关于 Is floating point math broken?

问题 Consider the following code: 0.1 + 0.2 == 0.3 -> false 0.1 + 0.2 -> 0.30000000000000004 Why do these inaccuracies happen? 回答1: Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1 , which is 1/10 ) whose denominator is not a power of two cannot be exactly represented. For 0.1 in the standard

Is floating point math broken?

阅读更多关于 Is floating point math broken?

atan2f gives different results with m32 flag

阅读更多关于 atan2f gives different results with m32 flag

问题 I'm porting some code from 32 bit to 64 bit, and ensuring the answers are the same. In doing so, I noticed that atan2f was giving different results between the two. I created this min repro: #include <stdio.h> #include <math.h> void testAtan2fIssue(float A, float B) { float atan2fResult = atan2f(A, B); printf("atan2f: %.15f\n", atan2fResult); float atan2Result = atan2(A, B); printf("atan2: %.15f\n", atan2Result); } int main() { float A = 16.323556900024414; float B = -5.843180656433105;

Full precision display of floating point numbers in C++?

阅读更多关于 Full precision display of floating point numbers in C++?

问题 I have read several topics about the display of floating point numbers display in C++ and I couldn't find a satisfying answer. My question is: how to display all the significant digits of a floating point numbers in C++ in a scientific format (mantissa/exponent) ? The problem is that all numbers do not have the same number of significant digits in base 10. For example a double has 15 to 17 significant decimal digits precision, but std::numeric_limits<double>::digits10 returns 15 and

Why does auto deduce this variable as double and not float? [duplicate]

阅读更多关于 Why does auto deduce this variable as double and not float? [duplicate]

问题 This question already has answers here : Why floating point value such as 3.14 are considered as double by default in MSVC? (5 answers) All floats are doubles? (3 answers) How a floating point literal is treated either double or float in Visual C++? (2 answers) why sizeof(13.33) is 8 bytes? (5 answers) What is the type of the value 1.0e+1 (4 answers) Closed 1 year ago . In the snippet below, auto deduces the variable to double , but I want float . auto one = 3.5; Does it always use double for

Why does auto deduce this variable as double and not float? [duplicate]

阅读更多关于 Why does auto deduce this variable as double and not float? [duplicate]

Swift: Casting a FloatingPoint conforming value to Double

阅读更多关于 Swift: Casting a FloatingPoint conforming value to Double

问题 I'm writing an extension to a FloatingPoint protocol. I want to cast it to a Double in any possible way. extension FloatingPoint { var toDouble: Double { return Double(exactly: self) ?? 0 //compile error } } I'd prefer not to tell what I'm trying to achieve so we can focus just on above problem. I'd prefer to hear that's impossible to do it this way instead of receiving a valid workaround for the bigger problem I'm trying to solve. I tried to use different Double constructors but maybe I wasn

Swift: Casting a FloatingPoint conforming value to Double

阅读更多关于 Swift: Casting a FloatingPoint conforming value to Double

Convert Sign-Bit, Exponent and Mantissa to float?

阅读更多关于 Convert Sign-Bit, Exponent and Mantissa to float?

问题 I have the Sign Bit, Exponent and Mantissa (as shown in the code below). I'm trying to take this value and turn it into the float. The goal of this is to get 59.98 (it'll read as 59.9799995 ) uint32_t FullBinaryValue = (Converted[0] << 24) | (Converted[1] << 16) | (Converted[2] << 8) | (Converted[3]); unsigned int sign_bit = (FullBinaryValue & 0x80000000); unsigned int exponent = (FullBinaryValue & 0x7F800000) >> 23; unsigned int mantissa = (FullBinaryValue & 0x7FFFFF); What I originally