The difference in performance is caused by the caching strategy of the computer.
The 2 dimensional array matrix[i][j]
is represented as a long list of values in the memory.
E.g the array A[3][4]
looks like:
1 1 1 1 2 2 2 2 3 3 3 3
In this example every entry of A[0][x] is set to 1, every entry of A[1][x] set to 2, ...
If your first loop is applied to this matrix the order of access is this:
1 2 3 4 5 6 7 8 9 10 11 12
While the second loops access order looks like this:
1 4 7 10 2 5 8 11 3 6 9 12
When the program accesses an element of the array it also loads subsequent elements.
E.g. if you access A[0][1]
, A[0][2]
and A[0][3]
are loaded too.
Thereby the first loop has to do less load operations, as some elements are already in the cache when needed.
The second loop loads entries into the cache that are not needed at the time, resulting in more load operations.