Below are two programs that are almost identical except that I switched the i
and j
variables around. They both run in different amounts of time. C
Nothing to do with assembly. This is due to cache misses.
C multidimensional arrays are stored with the last dimension as the fastest. So the first version will miss the cache on every iteration, whereas the second version won't. So the second version should be substantially faster.
See also: http://en.wikipedia.org/wiki/Loop_interchange.