I\'ve been using Intel\'s SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunate
Just for completeness. The Newton-Raphson (NR) implementation for operations like the division or the square root will only be beneficial if you have a limited number of those operations in your code. This is because if you used these alternative methods you will generate more pressure on other ports such as the multiplication and addition ports. That's basically the reason why x86 architectures have special hardware unit to handle these operation instead of the alternative software solutions (like NR). I quote from Intel 64 and IA-32 Architectures Optimization Reference Manual p.556:
"In some cases, when the divide or square root operations are part of a larger algorithm that hides some of the latency of these operations, the approximation with Newton-Raphson can slow down execution."
So be careful when using NR in large algorithms. Actually, I had my master's thesis around this point and I will leave a link to it here for future reference, once it is published .
Also for people how always wonder about the throughput and the latency of certain instructions, have a look on IACA. It is a very useful tool provided by Intel to statically analyze the in-core execution performance of codes.
edited here is a link to the thesis for those who are interested thesis