I\'ve used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that ar
From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:
There is a penalty for using the wrong type of instructions on some processors. This is because the processor may have different data buses or different execution units for integer and floating point data. Moving data between the integer and floating point units can take one or more clock cycles depending on the processor, as listed in table 13.2.
Processor Bypass delay, clock cycles Intel Core 2 and earlier 1 Intel Nehalem 2 Intel Sandy Bridge and later 0-1 Intel Atom 0 AMD 2 VIA Nano 2-3 Table 13.2. Data bypass delays between integer and floating point execution units