It is great that gcc compiler 4.8 comes with AVX optimization with -Ofast option. However, I found an interesting but stupid bug, that it adds additional computations which are
I think what you are seeing in the generated code is an additional iteration of Newton-Raphson to refine the initial estimate provided by vrcpps. (See: the Intel Intrinsics Guide for details of the accuracy of the initial estimate provided by vrcpps.)