I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other
FPU instructions are smaller than SSE instructions, so they are ideal for demoscene stuff
For hand-written asm, x87 has some instructions that don't exist in the SSE instruction set.
Off the top of my head, it's all trigonometric stuff like fsin, fcos, fatan, fatan2 and some exponential/logarithm stuff.
With gcc -O3 -ffast-math -mfpmath=387
, GCC9 will still actually inline sin(x)
as an fsin
instruction, regardless of what the implementation in libm would have used. (https://godbolt.org/z/Euc5gp).
MSVC calls __libm_sse2_sin_precise
when compiling for 32-bit x86.
If your code spends most of the time doing trigonometry, you may see a slight performance gain or loss if you use x87, depending on whether your standard math-library implementation using SSE1/SSE2 is faster or slower than the slow microcode for fsin
on whatever CPU you're using.
CPU vendors don't put a lot of effort into optimizing the microcode for x87 instructions in the newest generations of CPUs because it's generally considered obsolete and rarely used. (Look at uop counts and throughput for complex x87 instructions in Agner Fog's instruction tables in recent generations of CPUs: more cycles than in older CPUs). The newer the CPU, the more likely x87 will be slower than many SSE or AVX instructions to compute log, exp, pow, or trig functions.
Even when x87 is available, not all math libraries choose to use complex instructions like fsin
for implementing functions like sin()
, or especially exp/log where integer tricks for manipulating the log-based FP bit-patterns are useful.
Some DSP algorithms use a lot of trig, but typically benefit a lot from auto-vectorization with SIMD math libraries.
However, for math-code where you spend most of your time doing additions, multiplications etc. SSE is usually faster.
Also related: Intel Underestimates Error Bounds by 1.3 quintillion - the worst case for fsin
(catastrophic cancellation for fsin
inputs very near pi) is very bad. Software can do better but only with slow extended-precision techniques.
There is considerable legacy and small system compatibility with the x87: SSE is a relatively new processor feature. If your code is to run on an embedded microcontroller, there's a good chance it won't support SSE instructions.
Even systems which don't have an FPU installed will often provide 80x87 emulators which will make the code run transparently (more or less). I don't know of any SSE emulators—certainly one of my systems doesn't have any, so the newest Adobe Photoshop elements versions refuse to run.
The 80x87 instructions have good parallel operation characteristics which have been thoroughly explored and analyzed since its introduction in 1982 or so. Various clones of the x86 might stall on an SSE instructions.
Conversion between float
and double
is faster with x87 (usually free) than with SSE. With x87, you can load and store a float
, double
or long double
to or from the register stack and it is converted to or from extended precision without extra cost. With SSE, additional instructions are required to do the type conversion if types are mixed, because the registers contain float
or double
values. These conversion instructions are fairly fast but do take extra time.
The real fix is to refrain from mixing float
and double
excessively, not to use x87, of course.
EOF