ieee-754 | 易学教程

Causing underflow in ieee-754 floating point format using subtraction

阅读更多关于 Causing underflow in ieee-754 floating point format using subtraction

问题 This seems basic but I am having a lot of trouble answering the following question: Give two numbers X and Y represented in the IEEE754 format such that computing X-Y will result in underflow. To my understanding every operation can potentially result in underflow but for the life of mine I cant find an example for subtraction. PLEASE HELP!!! thanks 回答1: When default exception handling is in effect, a subtraction that produces a tiny (in the subnormal interval 1 ) non-zero result conceptually

IEEE 754 Bit manipulation Rounding Error

阅读更多关于 IEEE 754 Bit manipulation Rounding Error

问题 Without using casts or functionality of libraries, I must cast an integer to a float with bit manipulation. Below is the code I am currently working on. It is based off of code that I found in Cast Integer to Float using Bit Manipulation breaks on some integers in C. The problem that I have ran into involves the rounding standards in IEEE 754. More specifically my code rounds towards 0, but it should round towards even numbers. What changes do I need to make? unsigned inttofloat(int x) { int

Working with different IEEE floating-point rounding modes in C++

阅读更多关于 Working with different IEEE floating-point rounding modes in C++

问题 Woe is me, I have to ensure the same floating-point results on a GPU and on the CPU. Ok, I understand IEEE has taken care of me and provided a nice standard to adhere to with several rounding options; and the CUDA part is sorted out (there are intrinsics for the different rounding modes), so that's just motivation. But in host-side C++ code - how do I perform floating-point arithmetic in a specific rounding mode (and I mean in a specific statement, not throughout my translation unit)? Are

Are all integers with exponent over 52 are even in 64 bit floating point

阅读更多关于 Are all integers with exponent over 52 are even in 64 bit floating point

问题 Am I correct to conclude that: all integers with exponent over 52 in the 64 bit floating point will be even? For example, if exponent is 53 , and the mantissa is 0000000000000000000000000000000000000000000000000001 , then the number is 100000000000000000000000000000000000000000000000000010 - ends in 10 . If the mantissa is 54 the number ends with 100 . the more we increase the exponent, the more integer numbers can not be represented - for the exponent of 54 - it's impossible to represent one

Meaning of Precision Vs. Range of Double Types

阅读更多关于 Meaning of Precision Vs. Range of Double Types

问题 To begin with, allow me to confess that I'm an experienced programmer, who has over 10 years of programming experience. However, the question I'm asking here is the one, which has bugged me ever since, I first picked up a book on C about a decade back. Below is an excerpt from a book on Python, explaining about Python Floating type. Floating-point numbers are represented using the native double-precision (64-bit) representation of floating-point numbers on the machine. Normally this is IEEE

32-bit IEEE 754 single precision floating point to hexadecimal

阅读更多关于 32-bit IEEE 754 single precision floating point to hexadecimal

问题 I have learnt how to convert numbers to floating point (on top of binary, octal and hexadecimal), and know how to convert numbers to floating point. However, while looking through a worksheet I have been given, I have encountered the following question: Using 32-bit IEEE 754 single precision floating point show the representation of -12.13 in Hexadecimal. I have tried looking at the resources I have and still can't figure out how to answer the above. The answer given is 0xc142147b . Edit:

Result of the sum of random-ordered IEEE 754 double precision floats

阅读更多关于 Result of the sum of random-ordered IEEE 754 double precision floats

问题 Here is a pseudocode of my problem. I have an array of IEEE 754 double precision positive numbers. The array can come in a random order but numbers are always the same, just scrambled in their positions. Also these numbers can vary in a very wide range in the valid IEEE range of the double representation. Once I have the list, I initialize a variable: double sum_result = 0.0; And I accumulate the sum on sum_result , in a loop over the whole array. At each step I do: sum_result += my_double

Why are the bit strings representing 1.0 and 2.0 so different?

阅读更多关于 Why are the bit strings representing 1.0 and 2.0 so different?

问题 I recently started using Julia and I came upon the bits function, which returns the bit-string representation of its numeric argument. For example: julia> bits(1.0) "0011111111110000000000000000000000000000000000000000000000000000" However, while playing with this function, I was surprised to discover that bits returns very different bit strings for 1.0 and 2.0 : julia> bits(1.0) "0011111111110000000000000000000000000000000000000000000000000000" julia> bits(2.0)

Can all 32 bit ints be exactly represented as a double? [duplicate]

阅读更多关于 Can all 32 bit ints be exactly represented as a double? [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Which is the first integer that an IEEE 754 float is incapable of representing exactly? This is basic question, my feeling is that the answer is yes(int = 32 bits, double = 53 bit mantisa + sign bit). Basically can asserts fire? int x = get_random_int(); double dx = x; int x1 = (int) dx; assert(x1 ==x); if (INT_MAX-10>x) { dx+=10; int x2=(int) dx; assert(x+10 == x2); } Obviously stuff involving complicated

In IEEE 754, why does adding negative zero result in a no-op but adding positive zero does not?

阅读更多关于 In IEEE 754, why does adding negative zero result in a no-op but adding positive zero does not?

问题 I'm toying with some algorithm in Rust (though the language doesn't really matter for my question). Consider the code: #[no_mangle] pub fn test(x: f32) -> f32 { let m = 0.; x + m } fn main() { test(2.); } It produces the following LLVM IR and corresponding x86_64 asm (optimizations enabled): ;; LLVM IR define float @test(float %x) unnamed_addr #0 { start: %0 = fadd float %x, 0.000000e+00 ret float %0 } ;; x86_64 ; test: xorps xmm1, xmm1 addss xmm0, xmm1 ret If I change let m = 0.; to let m =