floating-point

Instruction FYL2XP1

自闭症网瘾萝莉.ら 提交于 2021-01-23 06:33:47
问题 I'm wondering why the instruction FYL2XP1 on x86-architecture computes exactly the mathematical formula y · log 2 ( x + 1). What's special with this formula? 回答1: The y operand is usually a compile time constant, for the moment forget about the x + 1 . Since log_b(x) = log_b(2) * log_2(x) the instruction allows to compute the logarithm in any base of x + 1 . Note that log_b(2) is a constant since it is seldom necessary to compute the logarithm with a degree of freedom in the base. FYL2XP1 and

Is floating point math broken?

折月煮酒 提交于 2021-01-20 13:53:52
问题 Consider the following code: 0.1 + 0.2 == 0.3 -> false 0.1 + 0.2 -> 0.30000000000000004 Why do these inaccuracies happen? 回答1: Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1 , which is 1/10 ) whose denominator is not a power of two cannot be exactly represented. For 0.1 in the standard

Is floating point math broken?

邮差的信 提交于 2021-01-20 13:53:05
问题 Consider the following code: 0.1 + 0.2 == 0.3 -> false 0.1 + 0.2 -> 0.30000000000000004 Why do these inaccuracies happen? 回答1: Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1 , which is 1/10 ) whose denominator is not a power of two cannot be exactly represented. For 0.1 in the standard

atan2f gives different results with m32 flag

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-20 07:12:55
问题 I'm porting some code from 32 bit to 64 bit, and ensuring the answers are the same. In doing so, I noticed that atan2f was giving different results between the two. I created this min repro: #include <stdio.h> #include <math.h> void testAtan2fIssue(float A, float B) { float atan2fResult = atan2f(A, B); printf("atan2f: %.15f\n", atan2fResult); float atan2Result = atan2(A, B); printf("atan2: %.15f\n", atan2Result); } int main() { float A = 16.323556900024414; float B = -5.843180656433105;

Full precision display of floating point numbers in C++?

萝らか妹 提交于 2021-01-20 04:43:12
问题 I have read several topics about the display of floating point numbers display in C++ and I couldn't find a satisfying answer. My question is: how to display all the significant digits of a floating point numbers in C++ in a scientific format (mantissa/exponent) ? The problem is that all numbers do not have the same number of significant digits in base 10. For example a double has 15 to 17 significant decimal digits precision, but std::numeric_limits<double>::digits10 returns 15 and

Why does auto deduce this variable as double and not float? [duplicate]

时光怂恿深爱的人放手 提交于 2021-01-19 14:24:14
问题 This question already has answers here : Why floating point value such as 3.14 are considered as double by default in MSVC? (5 answers) All floats are doubles? (3 answers) How a floating point literal is treated either double or float in Visual C++? (2 answers) why sizeof(13.33) is 8 bytes? (5 answers) What is the type of the value 1.0e+1 (4 answers) Closed 1 year ago . In the snippet below, auto deduces the variable to double , but I want float . auto one = 3.5; Does it always use double for

Why does auto deduce this variable as double and not float? [duplicate]

两盒软妹~` 提交于 2021-01-19 14:23:34
问题 This question already has answers here : Why floating point value such as 3.14 are considered as double by default in MSVC? (5 answers) All floats are doubles? (3 answers) How a floating point literal is treated either double or float in Visual C++? (2 answers) why sizeof(13.33) is 8 bytes? (5 answers) What is the type of the value 1.0e+1 (4 answers) Closed 1 year ago . In the snippet below, auto deduces the variable to double , but I want float . auto one = 3.5; Does it always use double for

Swift: Casting a FloatingPoint conforming value to Double

懵懂的女人 提交于 2021-01-05 06:39:14
问题 I'm writing an extension to a FloatingPoint protocol. I want to cast it to a Double in any possible way. extension FloatingPoint { var toDouble: Double { return Double(exactly: self) ?? 0 //compile error } } I'd prefer not to tell what I'm trying to achieve so we can focus just on above problem. I'd prefer to hear that's impossible to do it this way instead of receiving a valid workaround for the bigger problem I'm trying to solve. I tried to use different Double constructors but maybe I wasn

Swift: Casting a FloatingPoint conforming value to Double

不问归期 提交于 2021-01-05 06:37:47
问题 I'm writing an extension to a FloatingPoint protocol. I want to cast it to a Double in any possible way. extension FloatingPoint { var toDouble: Double { return Double(exactly: self) ?? 0 //compile error } } I'd prefer not to tell what I'm trying to achieve so we can focus just on above problem. I'd prefer to hear that's impossible to do it this way instead of receiving a valid workaround for the bigger problem I'm trying to solve. I tried to use different Double constructors but maybe I wasn

Convert Sign-Bit, Exponent and Mantissa to float?

孤街浪徒 提交于 2021-01-04 05:55:17
问题 I have the Sign Bit, Exponent and Mantissa (as shown in the code below). I'm trying to take this value and turn it into the float. The goal of this is to get 59.98 (it'll read as 59.9799995 ) uint32_t FullBinaryValue = (Converted[0] << 24) | (Converted[1] << 16) | (Converted[2] << 8) | (Converted[3]); unsigned int sign_bit = (FullBinaryValue & 0x80000000); unsigned int exponent = (FullBinaryValue & 0x7F800000) >> 23; unsigned int mantissa = (FullBinaryValue & 0x7FFFFF); What I originally