After some thought, I came up with the following code for multiplying two quaternions using SSE:
#include
Never mind. If I compile the code with gcc -msse3 -O1 -S instead, I get the following:
.text
.align 4,0x90
.globl __Z13_mm_cross4_psU8__vectorfS_
__Z13_mm_cross4_psU8__vectorfS_:
LFB644:
movaps %xmm0, %xmm5
movaps %xmm1, %xmm3
movaps %xmm0, %xmm2
shufps $27, %xmm0, %xmm5
movaps %xmm5, %xmm4
shufps $17, %xmm1, %xmm3
shufps $187, %xmm1, %xmm1
mulps %xmm3, %xmm2
mulps %xmm1, %xmm4
mulps %xmm5, %xmm3
mulps %xmm1, %xmm0
hsubps %xmm4, %xmm2
haddps %xmm3, %xmm0
movaps %xmm2, %xmm1
shufps $177, %xmm0, %xmm1
shufps $228, %xmm2, %xmm0
addsubps %xmm1, %xmm0
shufps $156, %xmm0, %xmm0
ret
That's only 18 instructions now. That's what I expected in the beginning. Oops.
You may be interested in the Agner Fog's C++ vector class library. It provides a Quaternion4f
and Quaternion4d
classes (including *
and *=
operators, of course), implemented by using SSE2 and AVX instruction sets respectively. The library is an Open Source project, so you may dig into the code and find a good implementation example to build your function on.
Later on, you may consult the "optimizing subroutines in assembly language" manual and provide an optimized, pure assembly implementation of the function or, while being aware of some low-level tricks, try to redesign the intrinsics approach in C.