Why is iterating though `std::vector` faster than iterating though `std::array`?

后端未结

关注

 2  1443

攒了一身酷

I recently asked this question: Why is iterating an std::array much faster than iterating an std::vector?

As people quickly pointed out, my benchmark had many flaws.

相关标签:

2条回答

后悔当初

2020-12-04 01:34
The difference is due to memory pages of array not being resident in process address space (global scope array is stored in .bss section of the executable that hasn't been paged in, it is zero-initialized). Whereas vector has just been allocated and zero-filled, so its memory pages are already present.

If you add
```
std::fill_n(v.data(), n, 1); // included in <algorithm>
```
as the first line of main to bring the pages in (pre-fault), that makes array time the same as that of vector.

On Linux, instead of that, you can do mlock(v.data(), v.size() * sizeof(v[0])); to bring the pages into the address space. See man mlock for full details.
0 讨论(0)
发布评论:

提交评论
- 加载中...

名媛妹妹

2020-12-04 01:37

Memory mapping/allocating is lazy: the first access to a page will cause a page fault exception (#PF on x86). This includes the BSS, as well as file-backed mappings like the text segment of your executable. These page faults are "valid" so they don't result in a SIGSEGV being delivered; instead the kernel allocates a physical page if necessary and wires up the hardware page tables so the load or store can rerun and not fault the 2nd time.

This is expensive, especially if the kernel doesn't "fault-around" and prepare multiple pages during one page fault. (Especially with Spectre + Meltdown mitigation enabled making user<->kernel round trips more expensive on current x86-64 hardware.)

You're letting std:vector's constructor write zeros to the array after dynamic allocation¹. std::vector does all the page-faulting outside your timed loop. This happens before main, while the implementation is running constructors for static objects.

But the array is zero-initialized so it gets placed in the BSS. The first thing to touch it is your loop. Your array<> loop pays for all the page faults inside the timed region.

If you used new int[n] to dynamically allocate but not initialize a block of memory, you'd see the same behaviour as from your static array<>. (Maybe slightly better if Linux is more willing to use transparent hugepages for a dynamic allocation instead of the BSS mapping.)

Footnote 1 std::vector in libstdc++ and libc++ is too stupid to take advantage of getting already-zeroed pages from the OS, like it could if it used calloc or equivalent. It would be possible if the library provided a new/delete-compatible allocator for zeroed memory.

C++ new/delete is crippled vs. malloc/free/calloc/realloc. I have no idea why ISO C++ left out calloc and realloc: both are very useful for large allocations, especially realloc for resizing a std::vector of trivially-copyable objects that might have room to grow its mapping without copying. But since new/delete aren't guaranteed to be compatible with malloc/free, and new is replaceable, libraries can't very easily use calloc and realloc even under the hood.

Another factor: read-only leaves pages CoW mapped to the same physical zero page

When lazy allocation is triggered by a read (instead of write), it reads as zero. (BSS pages read as zero, new pages from mmap(MAP_ANONYMOUS) read as all-zero.)

The (soft) page fault handler that wired up the HW page table didn't need to actually allocate a physical page aka page-frame to back that virtual page. Instead, Linux maps clean (unwritten) anonymous pages to a single physical zeroed page. (This applies across all tasks.)

If we make multiple passes over the array, this leads to the curious situation where we can get TLB misses but L1d or L3 hits (depending on hugepage or not) because we have multiple virtual pages pointing to the same physical location.

(Some CPUs, e.g. AMD Ryzen, use micro-tagging in the L1d cache to save, at the cost of the cache only being able to hit for one virtual address even if the same memory is mapped to multiple virtual addresses. Intel CPUs use true VIPT L1d caches and really can get this effect),

I made a test program for Linux that will use madvise(MADV_HUGEPAGE) (to encourage the kernel to defrag memory for hugepages) or madvise(MADV_NOHUGEPAGE) (to disable hugepages even for the read-only case).

For some reason Linux BSS pages don't use hugepages when you write them. Only reading them does use 2M hugepages (too big for L1d or L2, but does fit in L3. But we do get all TLB hits). It's hard to see this in /proc/PID/smaps because unwritten memory doesn't show up as "resident" at all. (Remember it's physically backed by a system-wide shared region of zeroes).

I made some changes to your benchmark code to rerun the sum loop multiple times after an init pass that either reads or writes the array, according to command-line args. The repeat-loop makes it run longer so we can get more precise timing, and to amortize the init so we get useful results from perf.

#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>
#include <sys/mman.h>

using namespace std;

constexpr int n = 100'000'000;
//vector<int> v(n);
alignas(4096) array<int, n> v;

//template<class T>
__attribute__((noinline))
int toucharray(volatile int *vv, int write_init) {
        int res=vv[0];
        for(int i=32 ; i<n ; i+=128)
                if(write_init)
                    vv[i] = 0;
                else
                    res += vv[i];
//      volatile int sum = res;  // noinline is fine, we don't need to stop multiple calls from CSEing
        return res;
}

template <class T>
__attribute__((noinline,noclone))
int sum_container(T &vv) {
    unsigned int res=0;
    for(int x : vv)
        res += x;
    __attribute__((used)) static volatile int sink;
    sink = res;  // a side-effect stops IPA from deciding that this is a pure function
    return res;
}

int main(int argc, char**argv)
{
    int write_init = 0;
    int hugepage = 0;
    if (argc>1) {
            hugepage = argv[1][0] & 1;
            write_init = argv[1][0] & 2;
    }
    int repcount = 1000;
    if (argc>2)
            repcount = atoi(argv[2]);

// TODO: option for no madvise.
    madvise(v.data(), n*sizeof(v[0]), MADV_SEQUENTIAL);
    madvise(v.data(), n*sizeof(v[0]), hugepage ? MADV_HUGEPAGE : MADV_NOHUGEPAGE);  
    madvise(v.data(), n*sizeof(v[0]), MADV_WILLNEED); 
 // SEQ and WILLNEED probably only matter for file-backed mappings to reduce hard page faults.
 //  Probably not encouraging faultahead / around for lazy-allocation soft page fault

    toucharray(v.data(), write_init);

    int res = 0;
    auto start = chrono::steady_clock::now();
    for(int i=0; i<repcount ; i++)
        res = sum_container(v);
    auto end = chrono::steady_clock::now();
    double elapsed =
        std::chrono::duration_cast<
            std::chrono::duration<double, std::milli>
        >(end - start).count();
    printf("result: %d\ntime: %f\n", res, elapsed);
}

best case: clang++ -O3 -march=native (skylake) actually unrolls with multiple accumulators, unlike gcc -funroll-loops which does a silly job.

On my Skylake i7-6700k with DDR4-2666 DRAM, configured for 4.2GHz max turbo and governor=performance -

# using std::array<int,n>
# 0&1 = 0 -> MADV_NOHUGEPAGE.  0&2 = 0 -> read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 0 1000
result: 0
time: 1961.952394

 Performance counter stats for './touchpage-array-madv-nohuge-argc.clang 0 1000':

          2,017.34 msec task-clock:u              #    1.000 CPUs utilized          
                50      context-switches          #    0.025 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            97,774      page-faults               #    0.048 M/sec                  
     8,287,680,837      cycles                    #    4.108 GHz                    
    14,500,762,859      instructions              #    1.75  insn per cycle         
            13,688      mem_load_retired.l2_hit:u #    0.007 M/sec                  
    12,501,329,912      mem_load_retired.l1_hit:u # 6196.927 M/sec                  
           144,559      mem_inst_retired.stlb_miss_loads:u #    0.072 M/sec                  

       2.017765632 seconds time elapsed

       1.979410000 seconds user
       0.036659000 seconds sys

Notice considerable TLB misses (mem_inst_retired.stlb_miss_loads:u counts 2nd-level TLB misses in user-space). And 97k page faults. That's pretty much exactly as many 4k pages as it takes to cover the 100M * 4 = 400MB array, so we got 1 fault per page with no pre-fault / fault-around.

Fortunately Skylake has two page-walk units so it can be doing two speculative page-walks in parallel. Also, all the data access is hitting in L1d so page-tables will stay hot in at least L2, speeding up page walks.

# using array
# MADV_HUGEPAGE,  read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 1 1000
result: 0
time: 5947.741408

 Performance counter stats for './touchpage-array-argc.clang 1 1000':

          5,951.40 msec task-clock:u              #    1.000 CPUs utilized          
                 9      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               687      page-faults               #    0.115 K/sec                  
    24,377,094,416      cycles                    #    4.096 GHz                    
    14,397,054,228      instructions              #    0.59  insn per cycle         
     2,183,878,846      mem_load_retired.l2_hit:u #  366.952 M/sec                  
       313,684,419      mem_load_retired.l1_hit:u #   52.708 M/sec                  
            13,218      mem_inst_retired.stlb_miss_loads:u #    0.002 M/sec                  

       5.951530513 seconds time elapsed

       5.944087000 seconds user
       0.003284000 seconds sys

Notice ~1/10th the TLB misses, but that out of the same ~12G mem loads, only 2G of them hit in L2, probably thanks to successful HW prefetch. (The rest did hit in L3 though.) And that we only had 687 page faults; a combination of faultaround and hugepages made this much more efficient.

And note that the time taken is 3x higher because of the bottleneck on L3 bandwidth.

Write-init of the array gives us the worst of both worlds:

# using array
# MADV_HUGEPAGE (no apparent effect on BSS)  and write-init

taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 3 1000
result: 0
time: 16510.222762

 Performance counter stats for './touchpage-array-argc.clang 3 1000':

         17,143.35 msec task-clock:u              #    1.000 CPUs utilized          
               341      context-switches          #    0.020 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            95,218      page-faults               #    0.006 M/sec                  
    70,475,978,274      cycles                    #    4.111 GHz                    
    17,989,948,598      instructions              #    0.26  insn per cycle         
       634,015,284      mem_load_retired.l2_hit:u #   36.983 M/sec                  
       107,041,744      mem_load_retired.l1_hit:u #    6.244 M/sec                  
        37,715,860      mem_inst_retired.stlb_miss_loads:u #    2.200 M/sec                  

      17.147615898 seconds time elapsed

      16.494211000 seconds user
       0.625193000 seconds sys

Lots of page faults. Also far more TLB misses.

std::vector version is basically the same as array:

strace shows that madvise didn't work because I didn't align the pointer. glibc / libstdc++ new tends to return a pointer that page-aligned + 16, with allocator bookkeeping in that first 16 bytes. For the array, I used alignas(4096) to make sure I could pass it to madvise.

madvise(0x7f760d133010, 400000000, MADV_HUGEPAGE) = -1 EINVAL (Invalid argument)

So anyway, with my kernel tuning settings, it only tries to defrag memory for hugepages on madvise, and memory is pretty fragmented ATM. So it didn't end up using any hugepages.

taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-vector-argv.clang 3 1000
result: 0
time: 16020.821517

 Performance counter stats for './touchpage-vector-argv.clang 3 1000':

         16,159.19 msec task-clock:u              #    1.000 CPUs utilized          
                17      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            97,771      page-faults               #    0.006 M/sec                  
    66,146,780,261      cycles                    #    4.093 GHz                    
    15,294,999,994      instructions              #    0.23  insn per cycle         
       217,426,277      mem_load_retired.l2_hit:u #   13.455 M/sec                  
       842,878,166      mem_load_retired.l1_hit:u #   52.161 M/sec                  
         1,788,935      mem_inst_retired.stlb_miss_loads:u #    0.111 M/sec                  

      16.160982779 seconds time elapsed

      16.017206000 seconds user
       0.119618000 seconds sys

I'm not sure why TLB misses is so much higher than for the THP read-only test. Maybe contention for memory access and/or eviction of cached page tables by touching more memory ends up slowing down pagewalks so TLB-prefetch doesn't keep up.

Out of the ~12G loads, HW prefetching was able to make about 1G of them hit in L1d or L2 cache.

0 讨论(0)