Why is iterating though `std::vector` faster than iterating though `std::array`?

后端 未结 2 1443
攒了一身酷
攒了一身酷 2020-12-04 00:56

I recently asked this question: Why is iterating an std::array much faster than iterating an std::vector?

As people quickly pointed out, my benchmark had many flaws.

相关标签:
2条回答
  • 2020-12-04 01:34

    The difference is due to memory pages of array not being resident in process address space (global scope array is stored in .bss section of the executable that hasn't been paged in, it is zero-initialized). Whereas vector has just been allocated and zero-filled, so its memory pages are already present.

    If you add

    std::fill_n(v.data(), n, 1); // included in <algorithm>
    

    as the first line of main to bring the pages in (pre-fault), that makes array time the same as that of vector.


    On Linux, instead of that, you can do mlock(v.data(), v.size() * sizeof(v[0])); to bring the pages into the address space. See man mlock for full details.

    0 讨论(0)
  • 2020-12-04 01:37

    Memory mapping/allocating is lazy: the first access to a page will cause a page fault exception (#PF on x86). This includes the BSS, as well as file-backed mappings like the text segment of your executable. These page faults are "valid" so they don't result in a SIGSEGV being delivered; instead the kernel allocates a physical page if necessary and wires up the hardware page tables so the load or store can rerun and not fault the 2nd time.

    This is expensive, especially if the kernel doesn't "fault-around" and prepare multiple pages during one page fault. (Especially with Spectre + Meltdown mitigation enabled making user<->kernel round trips more expensive on current x86-64 hardware.)

    You're letting std:vector's constructor write zeros to the array after dynamic allocation1. std::vector does all the page-faulting outside your timed loop. This happens before main, while the implementation is running constructors for static objects.

    But the array is zero-initialized so it gets placed in the BSS. The first thing to touch it is your loop. Your array<> loop pays for all the page faults inside the timed region.

    If you used new int[n] to dynamically allocate but not initialize a block of memory, you'd see the same behaviour as from your static array<>. (Maybe slightly better if Linux is more willing to use transparent hugepages for a dynamic allocation instead of the BSS mapping.)



    Footnote 1 std::vector in libstdc++ and libc++ is too stupid to take advantage of getting already-zeroed pages from the OS, like it could if it used calloc or equivalent. It would be possible if the library provided a new/delete-compatible allocator for zeroed memory.

    C++ new/delete is crippled vs. malloc/free/calloc/realloc. I have no idea why ISO C++ left out calloc and realloc: both are very useful for large allocations, especially realloc for resizing a std::vector of trivially-copyable objects that might have room to grow its mapping without copying. But since new/delete aren't guaranteed to be compatible with malloc/free, and new is replaceable, libraries can't very easily use calloc and realloc even under the hood.


    Another factor: read-only leaves pages CoW mapped to the same physical zero page

    When lazy allocation is triggered by a read (instead of write), it reads as zero. (BSS pages read as zero, new pages from mmap(MAP_ANONYMOUS) read as all-zero.)

    The (soft) page fault handler that wired up the HW page table didn't need to actually allocate a physical page aka page-frame to back that virtual page. Instead, Linux maps clean (unwritten) anonymous pages to a single physical zeroed page. (This applies across all tasks.)

    If we make multiple passes over the array, this leads to the curious situation where we can get TLB misses but L1d or L3 hits (depending on hugepage or not) because we have multiple virtual pages pointing to the same physical location.

    (Some CPUs, e.g. AMD Ryzen, use micro-tagging in the L1d cache to save, at the cost of the cache only being able to hit for one virtual address even if the same memory is mapped to multiple virtual addresses. Intel CPUs use true VIPT L1d caches and really can get this effect),

    I made a test program for Linux that will use madvise(MADV_HUGEPAGE) (to encourage the kernel to defrag memory for hugepages) or madvise(MADV_NOHUGEPAGE) (to disable hugepages even for the read-only case).

    For some reason Linux BSS pages don't use hugepages when you write them. Only reading them does use 2M hugepages (too big for L1d or L2, but does fit in L3. But we do get all TLB hits). It's hard to see this in /proc/PID/smaps because unwritten memory doesn't show up as "resident" at all. (Remember it's physically backed by a system-wide shared region of zeroes).

    I made some changes to your benchmark code to rerun the sum loop multiple times after an init pass that either reads or writes the array, according to command-line args. The repeat-loop makes it run longer so we can get more precise timing, and to amortize the init so we get useful results from perf.

    #include <vector>
    #include <array>
    #include <stdio.h>
    #include <chrono>
    #include <sys/mman.h>
    
    using namespace std;
    
    constexpr int n = 100'000'000;
    //vector<int> v(n);
    alignas(4096) array<int, n> v;
    
    //template<class T>
    __attribute__((noinline))
    int toucharray(volatile int *vv, int write_init) {
            int res=vv[0];
            for(int i=32 ; i<n ; i+=128)
                    if(write_init)
                        vv[i] = 0;
                    else
                        res += vv[i];
    //      volatile int sum = res;  // noinline is fine, we don't need to stop multiple calls from CSEing
            return res;
    }
    
    template <class T>
    __attribute__((noinline,noclone))
    int sum_container(T &vv) {
        unsigned int res=0;
        for(int x : vv)
            res += x;
        __attribute__((used)) static volatile int sink;
        sink = res;  // a side-effect stops IPA from deciding that this is a pure function
        return res;
    }
    
    int main(int argc, char**argv)
    {
        int write_init = 0;
        int hugepage = 0;
        if (argc>1) {
                hugepage = argv[1][0] & 1;
                write_init = argv[1][0] & 2;
        }
        int repcount = 1000;
        if (argc>2)
                repcount = atoi(argv[2]);
    
    // TODO: option for no madvise.
        madvise(v.data(), n*sizeof(v[0]), MADV_SEQUENTIAL);
        madvise(v.data(), n*sizeof(v[0]), hugepage ? MADV_HUGEPAGE : MADV_NOHUGEPAGE);  
        madvise(v.data(), n*sizeof(v[0]), MADV_WILLNEED); 
     // SEQ and WILLNEED probably only matter for file-backed mappings to reduce hard page faults.
     //  Probably not encouraging faultahead / around for lazy-allocation soft page fault
    
        toucharray(v.data(), write_init);
    
        int res = 0;
        auto start = chrono::steady_clock::now();
        for(int i=0; i<repcount ; i++)
            res = sum_container(v);
        auto end = chrono::steady_clock::now();
        double elapsed =
            std::chrono::duration_cast<
                std::chrono::duration<double, std::milli>
            >(end - start).count();
        printf("result: %d\ntime: %f\n", res, elapsed);
    }
    

    best case: clang++ -O3 -march=native (skylake) actually unrolls with multiple accumulators, unlike gcc -funroll-loops which does a silly job.

    On my Skylake i7-6700k with DDR4-2666 DRAM, configured for 4.2GHz max turbo and governor=performance -

    # using std::array<int,n>
    # 0&1 = 0 -> MADV_NOHUGEPAGE.  0&2 = 0 -> read-only init
    taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 0 1000
    result: 0
    time: 1961.952394
    
     Performance counter stats for './touchpage-array-madv-nohuge-argc.clang 0 1000':
    
              2,017.34 msec task-clock:u              #    1.000 CPUs utilized          
                    50      context-switches          #    0.025 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                97,774      page-faults               #    0.048 M/sec                  
         8,287,680,837      cycles                    #    4.108 GHz                    
        14,500,762,859      instructions              #    1.75  insn per cycle         
                13,688      mem_load_retired.l2_hit:u #    0.007 M/sec                  
        12,501,329,912      mem_load_retired.l1_hit:u # 6196.927 M/sec                  
               144,559      mem_inst_retired.stlb_miss_loads:u #    0.072 M/sec                  
    
           2.017765632 seconds time elapsed
    
           1.979410000 seconds user
           0.036659000 seconds sys
    

    Notice considerable TLB misses (mem_inst_retired.stlb_miss_loads:u counts 2nd-level TLB misses in user-space). And 97k page faults. That's pretty much exactly as many 4k pages as it takes to cover the 100M * 4 = 400MB array, so we got 1 fault per page with no pre-fault / fault-around.

    Fortunately Skylake has two page-walk units so it can be doing two speculative page-walks in parallel. Also, all the data access is hitting in L1d so page-tables will stay hot in at least L2, speeding up page walks.

    # using array
    # MADV_HUGEPAGE,  read-only init
    taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 1 1000
    result: 0
    time: 5947.741408
    
     Performance counter stats for './touchpage-array-argc.clang 1 1000':
    
              5,951.40 msec task-clock:u              #    1.000 CPUs utilized          
                     9      context-switches          #    0.002 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                   687      page-faults               #    0.115 K/sec                  
        24,377,094,416      cycles                    #    4.096 GHz                    
        14,397,054,228      instructions              #    0.59  insn per cycle         
         2,183,878,846      mem_load_retired.l2_hit:u #  366.952 M/sec                  
           313,684,419      mem_load_retired.l1_hit:u #   52.708 M/sec                  
                13,218      mem_inst_retired.stlb_miss_loads:u #    0.002 M/sec                  
    
           5.951530513 seconds time elapsed
    
           5.944087000 seconds user
           0.003284000 seconds sys
    

    Notice ~1/10th the TLB misses, but that out of the same ~12G mem loads, only 2G of them hit in L2, probably thanks to successful HW prefetch. (The rest did hit in L3 though.) And that we only had 687 page faults; a combination of faultaround and hugepages made this much more efficient.

    And note that the time taken is 3x higher because of the bottleneck on L3 bandwidth.


    Write-init of the array gives us the worst of both worlds:

    # using array
    # MADV_HUGEPAGE (no apparent effect on BSS)  and write-init
    
    taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 3 1000
    result: 0
    time: 16510.222762
    
     Performance counter stats for './touchpage-array-argc.clang 3 1000':
    
             17,143.35 msec task-clock:u              #    1.000 CPUs utilized          
                   341      context-switches          #    0.020 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                95,218      page-faults               #    0.006 M/sec                  
        70,475,978,274      cycles                    #    4.111 GHz                    
        17,989,948,598      instructions              #    0.26  insn per cycle         
           634,015,284      mem_load_retired.l2_hit:u #   36.983 M/sec                  
           107,041,744      mem_load_retired.l1_hit:u #    6.244 M/sec                  
            37,715,860      mem_inst_retired.stlb_miss_loads:u #    2.200 M/sec                  
    
          17.147615898 seconds time elapsed
    
          16.494211000 seconds user
           0.625193000 seconds sys
    

    Lots of page faults. Also far more TLB misses.

    std::vector version is basically the same as array:

    strace shows that madvise didn't work because I didn't align the pointer. glibc / libstdc++ new tends to return a pointer that page-aligned + 16, with allocator bookkeeping in that first 16 bytes. For the array, I used alignas(4096) to make sure I could pass it to madvise.

    madvise(0x7f760d133010, 400000000, MADV_HUGEPAGE) = -1 EINVAL (Invalid argument)

    So anyway, with my kernel tuning settings, it only tries to defrag memory for hugepages on madvise, and memory is pretty fragmented ATM. So it didn't end up using any hugepages.

    taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-vector-argv.clang 3 1000
    result: 0
    time: 16020.821517
    
     Performance counter stats for './touchpage-vector-argv.clang 3 1000':
    
             16,159.19 msec task-clock:u              #    1.000 CPUs utilized          
                    17      context-switches          #    0.001 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                97,771      page-faults               #    0.006 M/sec                  
        66,146,780,261      cycles                    #    4.093 GHz                    
        15,294,999,994      instructions              #    0.23  insn per cycle         
           217,426,277      mem_load_retired.l2_hit:u #   13.455 M/sec                  
           842,878,166      mem_load_retired.l1_hit:u #   52.161 M/sec                  
             1,788,935      mem_inst_retired.stlb_miss_loads:u #    0.111 M/sec                  
    
          16.160982779 seconds time elapsed
    
          16.017206000 seconds user
           0.119618000 seconds sys
    

    I'm not sure why TLB misses is so much higher than for the THP read-only test. Maybe contention for memory access and/or eviction of cached page tables by touching more memory ends up slowing down pagewalks so TLB-prefetch doesn't keep up.

    Out of the ~12G loads, HW prefetching was able to make about 1G of them hit in L1d or L2 cache.

    0 讨论(0)
提交回复
热议问题