mmap() vs. reading blocks

前端 未结 12 1412
醉酒成梦
醉酒成梦 2020-11-22 16:59

I\'m working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I\'ve got a first

相关标签:
12条回答
  • 2020-11-22 17:18

    There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.

    mmap seems like magic

    Taking the case where the file is already fully cached1 as the baseline2, mmap might seem pretty much like magic:

    1. mmap only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
    2. mmap doesn't require a copy of the file data from kernel to user-space.
    3. mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.

    In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.

    Well, it can.

    mmap is not actually magic because...

    mmap still does per-page work

    A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page in user-space, even though it might be hidden by the page-fault mechanism.

    For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 billion page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.

    mmap relies heavily on TLB performance

    Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.

    Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.

    Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!

    read() avoids these pitfalls

    The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:

    • Every read() call of N bytes must copy N bytes from kernel to user space.

    On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.

    So basically you have the following comparison to determine which is faster for a single read of a large file:

    Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?

    On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.

    In particular, the mmap approach becomes relatively faster when:

    • The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
    • The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
    • The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.

    ... while the read() approach becomes relatively faster when:

    • The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
    • The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
    • The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.

    The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).

    The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:

    • Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
    • Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.

    Update after Spectre and Meltdown

    The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.

    All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.

    On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.


    1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in 2.

    2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).

    3 You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.

    4 In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.

    0 讨论(0)
  • 2020-11-22 17:23

    I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.

    Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.

    The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.

    If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.

    This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).

    0 讨论(0)
  • 2020-11-22 17:24

    I think the greatest thing about mmap is potential for asynchronous reading with:

        addr1 = NULL;
        while( size_left > 0 ) {
            r = min(MMAP_SIZE, size_left);
            addr2 = mmap(NULL, r,
                PROT_READ, MAP_FLAGS,
                0, pos);
            if (addr1 != NULL)
            {
                /* process mmap from prev cycle */
                feed_data(ctx, addr1, MMAP_SIZE);
                munmap(addr1, MMAP_SIZE);
            }
            addr1 = addr2;
            size_left -= r;
            pos += r;
        }
        feed_data(ctx, addr1, r);
        munmap(addr1, r);

    Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap. I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.

    0 讨论(0)
  • 2020-11-22 17:26

    mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.

    Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.

    0 讨论(0)
  • 2020-11-22 17:29

    This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.

    0 讨论(0)
  • 2020-11-22 17:31

    Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).

    Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?

    0 讨论(0)
提交回复
热议问题