I have a large piece of code, part of whose body contains this piece of code:
result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);
My take would be that the processor has the time to compute the first multiplication when using the FPU while loading the next values. The SSE has to load all the values first.