Cache Memory Optimization Array Transpose: C

前端 未结 1 1793
鱼传尺愫
鱼传尺愫 2021-02-08 22:55
typedef int array[2][2];

void transpose(array dst, array src) {
    int i, j;
    for (j = 0; j < 2; j++) {
        for (i = 0; i < 2; i++) {
            dst[i][j         


        
1条回答
  •  粉色の甜心
    2021-02-08 23:20

    So, in the first instance you have two cache lines of 8 bytes each for a total of 16 bytes. I'll assume an int data size of 4 bytes. Given the placement of arrays in C and the offsets you've provided these are the memory lines which can be cached:

    Cacheable lines:
    #A: &src[0][0] = 0x00, &src[0][1] = 0x04
    #B: &src[1][0] = 0x08, &src[1][1] = 0x0C
    #C: &dst[0][0] = 0x10, &dst[0][1] = 0x14
    #D: &dst[1][0] = 0x18, &dst[1][1] = 0x1C
    

    Then we need to know the access order that each memory address is visited by the program. I'm assuming no optimizations which might cause reorderings by the compiler.

    Access order and cache behavior (assuming initially empty):
    #1: load src[0][0] --> Miss line A = cache slot 1
    #2: save dst[0][0] --> Miss line C = cache slot 2
    #3: load src[0][1] --> Hit  line A = cache slot 1
    #4: save dst[0][1] --> Hit  line C = cache slot 2
    #5: load src[1][0] --> Miss line B = cache slot 1 (LRU, replaces line A)
    #6: save dst[1][0] --> Miss line D = cache slot 2 (LRU, replaces line C)
    #7: load src[1][1] --> Hit  line B = cache slot 1
    #8: save dst[1][1] --> Hit  line D = cache slot 2
    

    Which, I think, matches your second answer. Then with a cache size of 32 bytes (4 lines), assuming all other factors are constant:

    Access order and cache behavior (assuming initially empty):
    #1: load src[0][0] --> Miss line A = cache slot 1
    #2: save dst[0][0] --> Miss line C = cache slot 2
    #3: load src[0][1] --> Hit  line A = cache slot 1
    #4: save dst[0][1] --> Hit  line C = cache slot 2
    #5: load src[1][0] --> Miss line B = cache slot 3
    #6: save dst[1][0] --> Miss line D = cache slot 4
    #7: load src[1][1] --> Hit  line B = cache slot 3
    #8: save dst[1][1] --> Hit  line D = cache slot 4
    

    They are identical. The only difference would be if you reran transpose again. In case 1 you would get the exact same behavior (well, you'ld start with the cache full of all the wrong things, so it might as well be empty). In the larger cache case, though, everything you need for the second call is already cached, so there will be no cache misses.

    The difference between my answers and yours is most likely due to our assumptions about the behavior of the compiler for your loop count registers (i and j). I would assume they are both stored in registers (and so would not affect the data cache). You may need to assume they are in memory somewhere (and therefore interact with the cache) to get the expected results.

    0 讨论(0)
提交回复
热议问题