Row-major vs Column-major confusion

后端 未结 9 1410
离开以前
离开以前 2021-01-31 19:27

I\'ve been reading a lot about this, the more I read the more confused I get.

My understanding: In row-major rows are stored contiguously in memory, in column-major colu

9条回答
  •  死守一世寂寞
    2021-01-31 19:45

    Ok, so given that the word "confusion" is literally in the title I can understand the level of...confusion.

    Firstly, this absolutely is a real problem

    Never, EVER succumb to the idea that "it is used be but...PC's nowadays..."

    Of the primary issues here are: -Cache eviction strategy (LRU, FIFO, etc.) as @Y.C.Jung was beginning to touch on -Branch prediction -Pipelining (it's depth, etc) -Actual physical memory layout -Size of memory -Architecture of machine, (ARM, MIPS, Intel, AMD, Motorola, etc.)

    This answer will focus on the Harvard architecture, Von Neumann machine as it is most applicable to the current PC.

    The memory hierarchy:

    https://en.wikipedia.org/wiki/File:ComputerMemoryHierarchy.svgis

    Is a juxtaposition of cost versus speed.

    For today's standard PC system this would be something like: SIZE: 500GB HDD > 8GB RAM > L2 Cache > L1 Cache > Registers. SPEED: 500GB HDD < 8GB RAM < L2 Cache < L1 Cache < Registers.

    This leads to the idea of Temporal and Spatial locality. One means how your data is organized, (code, working set, etc.), the other means physically where your data is organized in "memory."

    Given that "most" of today's PC's are little-endian (Intel) machines as of late, they lay data into memory in a specific little-endian ordering. It does differ from big-endian, fundamentally.

    https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Data/endian.html (covers it rather... swiftly ;) )

    (For the simplicity of this example, I am going to 'say' that things happen in single entries, this is incorrect, entire cache blocks are typically accessed and vary drastically my manufacturer, much less model).

    So, now that we have that our of the way, if, hypothetically your program demanded 1GB of data from your 500GB HDD, loaded into your 8GB of RAM, then into the cache hierarchy, then eventually registers, where your program went and read the first entry from your freshest cache line just to have your second (in YOUR code) desired entry happen to be sitting in the next cache line, (i.e. the next ROW instead of column you would have a cache MISS.

    Assuming the cache is full, because it is small, upon a miss, according to the eviction scheme, a line would be evicted to make room for the line that 'does' have the next data you need. If this pattern repeated you would have a MISS on EVERY attempted data retrieval!

    Worse, you would be evicting lines that actually have valid data you are about to need, so you will have to retrieve them AGAIN and AGAIN.

    The term for this is called: thrashing

    https://en.wikipedia.org/wiki/Thrashing_(computer_science) and can indeed crash a poorly written/error prone system. (Think windows BSOD)....

    On the other hand, if you had laid out the data properly, (i.e. Row major)...you WOULD still have misses!

    But these misses would only occur at the end of each retrieval, not on EVERY attempted retrieval. This results in orders of magnitude of difference in system and program performance.

    Very very simple snippet:

    #include
    
    #define NUM_ROWS 1024
    #define NUM_COLS 1024
    
    int COL_MAJOR [NUM_ROWS][NUM_COLS];
    
    int main (void){
            int i=0, j=0;
            for(i; i

    Now, compile with: gcc -g col_maj.c -o col.o

    Now, run with: time ./col.o real 0m0.009s user 0m0.003s sys 0m0.004s

    Now repeat for ROW major:

    #include
    
    #define NUM_ROWS 1024
    #define NUM_COLS 1024
    
    int ROW_MAJOR [NUM_ROWS][NUM_COLS];
    
    int main (void){
            int i=0, j=0;
            for(i; i

    Compile: terminal4$ gcc -g row_maj.c -o row.o Run: time ./row.o real 0m0.005s user 0m0.001s sys 0m0.003s

    Now, as you can see, the Row Major one was significantly faster.

    Not convinced? If you would like to see a more drastic example: Make the matrix 1000000 x 1000000, initialize it, transpose it and print it to stdout. ```

    (Note, on a *NIX system you WILL need to set ulimit unlimited)

    ISSUES with my answer: -Optimizing compilers, they change a LOT of things! -Type of system -Please point any others out -This system has an Intel i5 processor

提交回复
热议问题