How would fabs(double) be implemented on x86? Is it an expensive operation?

问题

High-level programming languages often provide a function to determine the absolute-value of a floating-point value. For example, in the C standard library, there is the fabs(double) function.

How is this library function actually implemented for x86 targets? What would actually be happening "under the hood" when I call a high-level function like this?

Is it an expensive operation (a combination of multiplication and taking the square root)? Or is the result found just by removing a negative sign in memory?

回答1:

In general, computing the absolute-value of a floating-point quantity is an extremely cheap and fast operation.

In practically all cases, you can simply treat the fabs function from the standard library as a black box, sprinkling it in your algorithms where necessary, without any need to worry about how it will affect execution speed.

If you want to understand why this is such a cheap operation, then you need to know a little bit about how floating-point values are represented. Although the C and C++ language standards do not actually mandate it, most implementations follow the IEEE-754 standard. In that standard, each floating-point value's representation contains a bit known as the sign bit, and this marks whether the value is positive or negative. For example, consider a double, which is a 64-bit double-precision floating-point value:

^{(Image courtesy of Codekaizen, via Wikipedia, licensed under CC-bySA.)}

You can see the sign bit over there on the far left, in light blue. This is true for all precisions of floating-point values in IEEE-754. Therefore, taking the absolute value basically just amounts to flipping a byte in the value's representation in memory. In particular, you just need to mask off the sign bit (bitwise-AND), forcing it to 0—thus, unsigned.

Assuming that your target architecture has hardware support for floating-point operations, this is generally a single, one-cycle instruction—basically, as fast as can possibly be. An optimizing compiler will inline a call to the fabs library function, emitting that single hardware instruction in its place.

If your target architecture doesn't have hardware support for floating-point (which is pretty rare nowadays), then there will be a library that emulates these semantics in software, thus providing floating-point support. Typically, floating-point emulation is slow, but finding the absolute value is one of the fastest things you can do, since it is literally just manipulating a bit. You'll pay the overhead of a function call to fabs, but at worst, the implementation of that function will just involve reading the bytes from memory, masking off the sign bit, and storing the result back to memory.

Looking specifically at x86, which does implement IEEE-754 in hardware, there are two main ways that your C compiler will transform a call to fabs into machine code.

In 32-bit builds, where the legacy x87 FPU is being used for floating-point operations, it will emit an fabs instruction. (Yep, same name as the C function.) This strips the sign bit, if present, from the floating-point value at the top of the x87 register stack. On AMD processors and Intel Pentium 4, fabs is a 1-cycle instruction with a 2-cycle latency. On AMD Ryzen and all other Intel processors, this is a 1-cycle instruction with a 1-cycle latency.

In 32-bit builds that can assume SSE support, and on all 64-bit builds (where SSE is always supported), the compiler will emit an ANDPS instruction^* that does exactly what I described above: it bitwise-ANDs the floating-point value with a constant mask, masking out the sign bit. Notice that SSE2 doesn't have a dedicated instruction for taking the absolute value like x87 does, but that it doesn't even need one, because the multi-purpose bitwise-op instructions serve the job just fine. The execution time (cycles, latency, etc. characteristics) vary a bit more widely from one processor microarchitecture to another, but it generally has a throughput of 1–3 cycles, with a similar latency. If you like, you can look it up in Agner Fog's instruction tables for the processors of interest.

If you're really interested in digging into this, you might see this answer (hat tip to Peter Cordes), which explores a variety of different ways to implement an absolute-value function using SSE instructions, comparing their performance and discussing how you could get a compiler to generate the appropriate code. As you can see, since you're just manipulating bits, there are a variety of possible solutions! In practice, though, the current crop of compilers do exactly as I've described for the C library function fabs, which makes sense, because this is the best general-purpose solution.

__
_{^* Technically, this might also be ANDPD, where the D means "double" (and the S meant "single"), but ANDPD requires SSE2 support. SSE supports single-precision floating-point operations, and was available all the way back to the Pentium III. SSE2 is required for double-precision floating-point operations, and was introduced with the Pentium 4. SSE2 is always supported on x86-64 CPUs. Whether ANDPS or ANDPD is used is a decision made by the compiler's optimizer; sometimes you will see ANDPS being used on a double-precision floating-point value, since it just requires writing the mask the right way.}
_{Also, on CPUs that support AVX instructions, you'll generally see a VEX-prefix on the ANDPS/ANDPD instruction, so that it becomes VANDPS/VANDPD. Details on how this works and what its purpose is can be found elsewhere online; suffice it to say that mixing VEX and non-VEX instructions can result in a performance penalty, so compilers try to avoid it. Again, though, both of these versions have the same effect and virtually identical execution speeds.}

_{Oh, and because SSE is a SIMD instruction set, it is possible to compute the absolute value of multiple floating-point values at once. This, as you might imagine, is especially efficient. Compilers with auto-vectorization capabilities will generate code like this where possible. Example (mask can either be generated on-the-fly, as shown here, or loaded as a constant):}

cmpeqd xmm1, xmm1     ; generate the mask (all 1s) in a temporary register
psrld  xmm1, 1        ; put 1s in but the left-most bit of each packed dword
andps  xmm0, xmm1     ; mask off sign bit in each packed floating-point value

来源：https://stackoverflow.com/questions/44630015/how-would-fabsdouble-be-implemented-on-x86-is-it-an-expensive-operation

标签

floating-point

x86

absolute-value