I recently wrote a Vector 3 class, and I submitted my normalize() function for reviewal to a friend. He said it was good, but that I should multiply by the reciprocal where poss
Think about it in terms of elementary operations that hardware can more easily implement -- add, subtract, shift, compare. Multiplication even in a trivial setup requires fewer such elementary steps -- plus, it afford advances algorithms that are even faster -- see here for example... but hardware generally doesn't take advantage of those (except maybe extremely specialized hardware). For example, as the wikipedia URL says, "Toom–Cook can do a size-N cubed multiplication for the cost of five size-N multiplications" -- that's pretty fast indeed for very large numbers (Fürer's algorithm, a pretty recent development, can do Θ(n ln(n) 2Θ(ln*(n)))
-- again, see the wikipedia page and links therefrom).
Division's just intrisically slower, as -- again -- per wikipedia; even the best algorithms (some of which ARE implemented in HW, just because they're nowhere as sophisticated and complex as the very best algorithms for multiplication;-) can't hold a candle to the multiplication ones.
Just to quantify the issue with not-so-huge numbers, here are some results with gmpy, an easy-to-use Python wrapper around GMP, which tends to have pretty good implementations of arithmetic though not necessarily the latest-and-greatest wheezes. On a slow (first-generation;-) Macbook Pro:
$ python -mtimeit -s'import gmpy as g; a=g.mpf(198792823083408); b=g.mpf(7230824083); ib=1.0/b' 'a*ib'
1000000 loops, best of 3: 0.186 usec per loop
$ python -mtimeit -s'import gmpy as g; a=g.mpf(198792823083408); b=g.mpf(7230824083); ib=1.0/b' 'a/b'
1000000 loops, best of 3: 0.276 usec per loop
As you see, even at this small size (number of bits in the numbers), and with libraries optimized by exactly the same speed-obsessed people, multiplication by the reciprocal can save 1/3 of the time that division takes.
It may be only in rare situations that these few nanoseconds are a life-or-death issue, but, when they are, and of course IF you are repeatedly dividing by the same value (to amortize away the 1.0/b
operation!), then this knowledge can be a life-saver.
(Much in the same vein -- x*x
will often save time compared to x**2
[in languages that have a **
"raise to power" operator, like Python and Fortran] -- and Horner's scheme for polynomial computation is VASTLY preferable to repeated raise-to-power operations!-).
If you think back to grade school, you'll recall that multiplication was harder than addition and division was harder than multiplication. Things aren't any different for the CPU.
Recall also that calculating the reciprocal involves a division, so unless you calculate the reciprocal once and use it three times, you won't see a speed up.
The CPU operation for (float) division is much more complicated than multiplication. The CPU has to do more. I am far from knowledgeable about hardware, but you can find a lot of info about common division implementation (based on newton-raphson algorithms, for example).
I would also be careful about always using multiplication of the reciprocal instead of division to gain CPU performance: they may not give exactly the same results. This may or may not matter depending on your application.