typedef int array[2][2];
void transpose(array dst, array src) {
int i, j;
for (j = 0; j < 2; j++) {
for (i = 0; i < 2; i++) {
dst[i][j
So, in the first instance you have two cache lines of 8 bytes each for a total of 16 bytes. I'll assume an int data size of 4 bytes. Given the placement of arrays in C and the offsets you've provided these are the memory lines which can be cached:
Cacheable lines:
#A: &src[0][0] = 0x00, &src[0][1] = 0x04
#B: &src[1][0] = 0x08, &src[1][1] = 0x0C
#C: &dst[0][0] = 0x10, &dst[0][1] = 0x14
#D: &dst[1][0] = 0x18, &dst[1][1] = 0x1C
Then we need to know the access order that each memory address is visited by the program. I'm assuming no optimizations which might cause reorderings by the compiler.
Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit line A = cache slot 1
#4: save dst[0][1] --> Hit line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 1 (LRU, replaces line A)
#6: save dst[1][0] --> Miss line D = cache slot 2 (LRU, replaces line C)
#7: load src[1][1] --> Hit line B = cache slot 1
#8: save dst[1][1] --> Hit line D = cache slot 2
Which, I think, matches your second answer. Then with a cache size of 32 bytes (4 lines), assuming all other factors are constant:
Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit line A = cache slot 1
#4: save dst[0][1] --> Hit line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 3
#6: save dst[1][0] --> Miss line D = cache slot 4
#7: load src[1][1] --> Hit line B = cache slot 3
#8: save dst[1][1] --> Hit line D = cache slot 4
They are identical. The only difference would be if you reran transpose again. In case 1 you would get the exact same behavior (well, you'ld start with the cache full of all the wrong things, so it might as well be empty). In the larger cache case, though, everything you need for the second call is already cached, so there will be no cache misses.
The difference between my answers and yours is most likely due to our assumptions about the behavior of the compiler for your loop count registers (i and j). I would assume they are both stored in registers (and so would not affect the data cache). You may need to assume they are in memory somewhere (and therefore interact with the cache) to get the expected results.