I\'ve been using Intel\'s SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunate
Just for completeness. The Newton-Raphson (NR) implementation for operations like the division or the square root will only be beneficial if you have a limited number of those operations in your code. This is because if you used these alternative methods you will generate more pressure on other ports such as the multiplication and addition ports. That's basically the reason why x86 architectures have special hardware unit to handle these operation instead of the alternative software solutions (like NR). I quote from Intel 64 and IA-32 Architectures Optimization Reference Manual p.556:
"In some cases, when the divide or square root operations are part of a larger algorithm that hides some of the latency of these operations, the approximation with Newton-Raphson can slow down execution."
So be careful when using NR in large algorithms. Actually, I had my master's thesis around this point and I will leave a link to it here for future reference, once it is published .
Also for people how always wonder about the throughput and the latency of certain instructions, have a look on IACA. It is a very useful tool provided by Intel to statically analyze the in-core execution performance of codes.
edited here is a link to the thesis for those who are interested thesis
This is because VSQRTPS
(AVX instruction) takes exactly twice as many cycles as SQRTPS
(SSE instruction) on a Sandy Bridge processor. See Agner Fog's optimize guide: instruction tables, page 88.
Instructions like square root and division don't benefit from AVX. On the other hand, additions, multiplications, etc., do.
Depending on your processor hardware, the AVX instructions may be emulated in the hardware as SSE instructions. You'd need to look up your processor's part number to get exact specs on it, but this is one of the main differences between low-end and high-end intel processors, the number of specialize execution units vs. hardware emulation.
If you are interested in increasing square root performance, instead of VSQRTPS you can use VRSQRTPS and Newton-Raphson formula:
x0 = vrsqrtps(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)
VRSQRTPS itself doesn't benefit from AVX, but other calculations do.
Use it if 23 bits of precision is enough for you.