I just stumbled upon this blog post. The author shows two code samples that loop through a rectangle and compute something (my guess is the computing code is just a placehol
Yeah, 'cache coherence'...of course it depends, you could optimize memory allocation for vertical scans. Traditionally video memory is allocated left-to-right, top-to-bottom, going back I'm sure to the days of CRT screens which drew scanlines the same way. In theory you could change this though--all this to say there isn't anything intrinsic about the horizontal method.
An answer has been accepted, but I don't think it's the whole story.
Yes, cache is a big part of the reason all those elements have to be stored in memory in some order. If you index through them in the order they are stored, you are likely to have less cache misses. Likely.
The other issue (also mentioned by a lot of answers) is that pretty much every processor has a very fast integer increment instruction. They do not typically have a very fast "increment by some amount multiplied by this second arbirary amount". That's what you are asking for when you index "against the grain".
A third issue is optimization. A lot of effort and research has been put into optimizing loops of this kind, and your compiler will be much more likely to be able to put one of those optimizations into affect if you index through it in some reasonable kind of order.
Cache is indeed the reason, but if you want to know the meat of the argument, you could take a look at the "What Every Programmer Should Know About Memory" by U. Drepper:
http://people.redhat.com/drepper/cpumemory.pdf
The reason is because there is really no such thing as a 2 dimensional array when you get down to the hardware level of how memory is laid out. So scanning 'vertically' to get to the next cell you need to visit you're doing an operation along these lines
For a 2D array indexed as (row, column) this needs to be translated into a single dimension array of array[index] because memory in a computer is linear.
So if you're scanning vertically, the next index is calculated as:
index = row * numColumns + col;
however, if you're scanning horizontally then the next index is just as follows:
index = index++;
A single addition is going to be fewer op codes for the CPU then a multiplication AND addition, and thus horizontal scanning is faster because of the architecture of computer memory.
Cache is not the answer because if this is the first time you're loading this data, every data access will be a cache miss. For the very first execution, horizontal is faster because there are fewer operations. Subsequent loops through the triangle will be made faster by cache, and vertical could be slower because of cache misses if the triangle is sufficiently large, but will always be slower than horizontal scanning because of the increased number of operations needed to access the next element.
To expand on the previous answers a bit:
Usually, as programmers, we can think of our programs' addressable memory as a flat array of bytes, from 0x00000000 to 0xFFFFFFFF. The operating system will reserve some of those addresses (all the ones lower than 0x800000000, say) for its own use, but we can do what we like with the others. All those memory locations live in the computer's RAM, and when we want to read from them or write to them we issue the appropriate instructions.
But this isn't true! There are a bunch of complications tainting that simple model of process memory: virtual memory, swapping, and the cache.
Talking to RAM takes a fairly long time. It's much faster than going to the hard disk, as there aren't any spinning plates or magnets involved, but it's still pretty slow by the standards of a modern CPU. So, when you try to read from a particular location in memory, your CPU doesn't just read that one location into a register and call it good. Instead, it reads that location, /and a bunch of nearby locations/, into a processor cache that lives on the CPU and can be accessed much more quickly than main memory.
Now we have a more complicated, but more correct, view of the computer's behavior. When we try to read a location in memory, first we look in the processor cache to see if the value at that location is already stored there. If it is, we use the value in the cache. If it isn't, we take a longer trip into main memory, retrieve the value as well as several of its neighbors and stick them in the cache, kicking out some of what used to be there to make room.
Now we can see why the second code snippet is faster than the first. In the second example, we first access a[0]
, b[0]
, and c[0]
. Each of those values is cached, along with their neighbors, say a[1..7]
, b[1..7]
, and c[1..7]
. Then when we access a[1]
, b[1]
, and c[1]
, they're already in the cache and we can read them quickly. Eventually we get to a[8]
, and have to go to RAM again, but seven times out of eight we're using nice fast cache memory instead of clunky slow RAM memory.
(So why don't accesses to a
, b
, and c
kick each other out of the cache? It's a bit complicated, but essentially the processor decides where to store a given value in the cache by its address, so three objects that aren't near each other spatially are unlikely to be cached into the same location.)
By contrast, consider the first snippet from lbrandy's post. We first read a[0]
, b[0]
, and c[0]
, caching a[1..7]
, b[1..7]
, and c[1..7]
. Then we access a[width]
, b[width]
, and c[width]
. Assuming width is >= 8 (which it probably is, or else we wouldn't care about this sort of low-level optimization), we have to go to RAM again, caching a new set of values. By the time we get to a[1]
, it will probably have been kicked out of the cache to make room for something else. In the not-at-all-uncommon case of a trio of arrays that are larger than the processor cache, it's likely that /every single read/ will miss the cache, degrading performance enormously.
This has been a very high-level discussion of modern caching behavior. For something more in-depth and technical, this looks like a thorough yet readable treatment of the subject.
Cache coherence. When you scan horizontally, your data will be closer together in memory, so you will have less cache misses and thus performance will be faster. For a small enough rectangle, this won't matter.