Memory hardware is not optimized to deliver individual addresses: instead it tends to operate on larger chunks of continuous memory called cache lines. Every time you read one entry of your matrix, the entire cache line it lies in also gets loaded into cache along with it.
The faster loop ordering is set up to read memory in order; each time you load a cache line, you use all of the entries in that cache line. Each pass through the outer loop, you read each matrix entry only a single time.
The slower loop ordering, however, only uses a single entry from each cache line before moving on. Thus, each cache line has to get loaded multiple times, once for each matrix entry in the line. e.g. if a double
is 8 byes and a cache line is 64 bytes long, then each pass through the outer loop has to read each matrix entry eight times rather than a single time.
All that said, if you had turned optimizations on, you would probably see no difference: optimizers understand this phenomenon, and good ones are capable of recognizing that they can swap which loop is the inner loop and which loop is the outer loop for this particular code snippet.
(also, a good optimizer would have only done one pass through the outermost loop, because it recognizes the first 999 times through are irrelevant to the final value of sum
)