Why is the second loop over a static array in the BSS faster than the first?

后端 未结 3 1516
隐瞒了意图╮
隐瞒了意图╮ 2021-01-13 10:37

I have the following code that writes a global array with zeros twice, once forward and once backward.

#include 
#include 
#inc         


        
相关标签:
3条回答
  • 2021-01-13 11:14

    Following asimes answer that it's due to caching - i'm not convinced that you can enjoy the benefit of caches with a ~100M array, you're likely to completely thrash out any useful data before returning there.

    However, depending on your platform (OS mostly), there are other mechanisms as work - when you allocate the arrays you never initialize them, so the first loop probably incurs the penalty of the first access per each 4k page. This usually would cause some assist of a syscall that comes with a high overhead.
    In this case you also modify the page, so most system would be forced to perform a copy-on-write flow (an optimization that works as long as you read only from a page), this is even heavier.

    Adding a small access per page (which should be negligible with regards to actual caching and it only fetches one 64B line out of each 4k page), managed to make the results more even on my system (although this form of measurement isn't very accurate to begin with)

    #include <string.h>
    #include <time.h>
    #include <stdio.h>
    #define SIZE 100000000
    
    char c[SIZE];
    char c2[SIZE];
    
    int main()
    {
       int i;
       for(i = 0; i < SIZE; i+=4096)      ////  access and modify each page once
           c[i] = 0;                      ////
    
       clock_t t = clock();
    
       for(i = 0; i < SIZE; i++)
           c[i] = 0;
    
       t = clock() - t;
       printf("%d\n\n", t);
    
       t = clock(); 
       for(i = SIZE - 1; i >= 0; i--)
          c[i] = 0;
    
       t = clock() - t;
       printf("%d\n\n", t);
    }
    
    0 讨论(0)
  • 2021-01-13 11:25

    If you modify the second loop to be identical to the first the effect is the same, the second loop is faster:

    int main() {
       int i;
    
       clock_t t = clock();
       for(i = 0; i < SIZE; i++)
           c[i] = 0;
       t = clock() - t;
       printf("%d\n\n", t);
    
       t = clock(); 
       for(i = 0; i < SIZE; i++)
          c[i] = 0;
       t = clock() - t;
       printf("%d\n\n", t);
    }
    

    This is due to the first loop loading the information into the cache and that information being readily accessible during the second loop

    Results of the above:

    317841
    277270
    

    Edit: Leeor brings up a good point, c does not fit in the cache. I have an Intel Core i7 processor: http://ark.intel.com/products/37147/Intel-Core-i7-920-Processor-8M-Cache-2_66-GHz-4_80-GTs-Intel-QPI

    According to the link, this means the L3 cache is only 8 MB, or 8,388,608 bytes and c is 100,000,000 bytes

    0 讨论(0)
  • 2021-01-13 11:34

    When you defined some global data in C, it is zero-initialized:

    char c[SIZE];
    char c2[SIZE];
    

    In linux (unix) world this means, than both c and c2 will be allocated in special ELF file section, the .bss:

    ... data segment containing statically-allocated variables represented solely by zero-valued bits initially

    The .bss segment is created to not store all zeroes in the binary, it just says something like "this program wants to have 200MB of zeroed memory".

    When you program is loaded, ELF loader (kernel in case of classic static binaries, or ld.so dynamic loader also known as interp) will allocate the memory for .bss, usually like something like mmap with MAP_ANONYMOUS flag and READ+WRITE permissions/protection request.

    But memory manager in the OS kernel will not give you all 200 MB of zero memory. Instead it will mark part of virtual memory of your process as zero-initialized, and every page of this memory will point to the special zero page in physical memory. This page has 4096 bytes of zero byte, so if you are reading from c or c2, you will get zero bytes; and this mechanism allow kernel cut down memory requirements.

    The mappings to zero page are special; they are marked (in page table) as read-only. When you do first write to the any of such virtual pages, the General protection fault or pagefault exception will be generated by hardware (I'll say, by MMU and TLB). This fault will be handled by kernel, and in your case, by minor pagefault handler. It will allocate one physical page, fill it by zero bytes, and reset mapping of just-accesed virtual page to this physical page. Then it will rerun faulted instruction.

    I converted your code a bit (both loops are moved to separate function):

    $ cat b.c
    #include <string.h>
    #include <time.h>
    #include <stdio.h>
    #define SIZE 100000000
    
    char c[SIZE];
    char c2[SIZE];
    
    void FIRST()
    {
       int i;
       for(i = 0; i < SIZE; i++)
           c[i] = 0;
    }
    
    void SECOND()
    {
       int i;
       for(i = 0; i < SIZE; i++)
           c[i] = 0;
    }
    
    
    int main()
    {
       int i;
       clock_t t = clock();
       FIRST();
       t = clock() - t;
       printf("%d\n\n", t);
    
       t = clock(); 
       SECOND();
    
       t = clock() - t;
       printf("%d\n\n", t);
    }
    

    Compile with gcc b.c -fno-inline -O2 -o b, then run under linux's perf stat or more generic /usr/bin/time to get pagefault count:

    $ perf stat ./b
    139599
    
    93283
    
    
     Performance counter stats for './b':
     ....
                24 550 page-faults               #    0,100 M/sec           
    
    
    $ /usr/bin/time ./b
    234246
    
    92754
    
    Command exited with non-zero status 7
    0.18user 0.15system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 98136maxresident)k
    0inputs+8outputs (0major+24576minor)pagefaults 0swaps
    

    So, we have 24,5 thousands of minor pagefaults. With standard page size on x86/x86_64 of 4096 this is near 100 megabytes.

    With perf record/perf report linux profiler we can find, where pagefaults occur (are generated):

    $ perf record -e page-faults ./b
    ...skip some spam from non-root run of perf...
    213322
    
    97841
    
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.018 MB perf.data (~801 samples) ]
    
    $ perf report -n |cat
    ...
    # Samples: 467  of event 'page-faults'
    # Event count (approx.): 24583
    #
    # Overhead       Samples  Command      Shared Object                   Symbol
    # ........  ............  .......  .................  .......................
    #
        98.73%           459        b  b                  [.] FIRST              
         0.81%             1        b  libc-2.19.so       [.] __new_exitfn       
         0.35%             1        b  ld-2.19.so         [.] _dl_map_object_deps
         0.07%             1        b  ld-2.19.so         [.] brk                
         ....
    

    So, now we can see, that only FIRST function generates pagefaults (on first write to bss pages), and SECOND does not generate any. Every pagefault corresponds to some work, done by OS kernel, and this work is done only one time per page of bss (because bss is not unmapped and remapped back).

    0 讨论(0)
提交回复
热议问题