Today when I was in computer organization class, teacher talked about something interesting to me. When it comes to talk about Why cache memory works, he said that:
There is a very good paper by Ulrich Drepper of Red Hat and glibc fame, What Every Programmer Should Know About Memory. One section discussed caches in great detail. For example, there are cache effects in SMP systems where CPUs can end up thrashing ownership of a modified cache line back and forth, greatly harming performance.
In C, n-dimensional matrices are row major, meaning the last index into the matrix represents adjacent spaces in memory. This is different than some other languages, FORTRAN for example, which are column major. In FORTRAN, it's more efficient to iterate through a 2D matrix like this:
do jj = 1,N
do ii = 1,M
x(ii,jj) = x(ii,jj) + K;
enddo
enddo
Locality of reference. Because the data is stored by rows, for each row the j columns are in adjacent memory addresses. The OS will typically load an entire page from memory into the cache and adjacent address references will likely refer to that same page. If you increment by the row index in the inner loop it is possible that these rows will be on different pages (since they are separated by j doubles each) and the cache may have to constantly bring in and throw away pages of memory as it references the data. This is called thrashing and is bad for performance.
In practice and with larger, modern caches, the sizes of the rows/columns would need to be reasonably large before this would come into play, but it's still good practice.
[EDIT] The answer above is specific to C and may differ for other languages. The only one that I know is different is FORTRAN. FORTRAN stores things in column major order (the above is row major) and it would be correct to change the order of the statements in FORTRAN. If you want/need efficiency, it's important to know how your language implements data storage.
Cache memory is very fast and very expensive memory that sits close to the CPU. Rather than fetch one small piece of data from the RAM each time, the CPU fetches a chunk of data and stores it in the cache. The bet is that if you just read one byte, then the next byte you read is likely to be right after it. If this is the case, then it can come from the cache.
By laying out your loop as you have it, you read the bytes in the order that they are stored in memory. This means that they are in the cache, and can be read very quickly by the CPU. If you swapped around lines 1 and 2, then you'd read every "N" bytes each time around the loop. The bytes you are reading are no longer consecutive in memory, and so they may not be in the cache. The CPU has to fetch them from the (slower) RAM, and so your performance decreases.
It is like that becauses caches like locality. The same number of memory accessed, but spaced further apart, will hit different "lines" of cache, or might even miss the cache altogether. It is therefore good, whenever you have the choice, to organize data so that accesses that are likely to happen close to each other in time, also do so in space. This increases the chance of a cache hit, and gives you more performance.
There is of course a wealth of information about this topic available, see for instancethis wikipedia entry on locality of reference. Or, I guess, your own course text book. :)