I\'ve been using Intel\'s SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunate
This is because VSQRTPS
(AVX instruction) takes exactly twice as many cycles as SQRTPS
(SSE instruction) on a Sandy Bridge processor. See Agner Fog's optimize guide: instruction tables, page 88.
Instructions like square root and division don't benefit from AVX. On the other hand, additions, multiplications, etc., do.