Linux/perl mmap performance

后端 未结 9 1977
面向向阳花
面向向阳花 2021-02-14 10:46

I\'m trying to optimize handling of large datasets using mmap. A dataset is in the gigabyte range. The idea was to mmap the whole file into memory, allowing multiple processes t

相关标签:
9条回答
  • 2021-02-14 11:27

    On 32-bit systems the address space for mmap()s is rather limited (and varies from OS to OS). Be aware of that if you're using multi-gigabyte files and your are only testing on a 64-bit system. (I would have preferred to write this in a comment but I don't have enough reputation points yet)

    0 讨论(0)
  • 2021-02-14 11:36

    That does sound surprising. Why not try a pure C version?

    Or try your code on a different OS/perl version.

    0 讨论(0)
  • 2021-02-14 11:40

    Your access to that file had better be well random to justify a full mmap. If your usage isn't evenly distributed, you're probably better off with a seek, read to a freshly malloced area and process that, free, rinse and repeat. And work with chunks of multiples of 4k, say 64k or so.

    I once benchmarked a lot string pattern matching algorithms. mmaping the entire file was slow and pointless. Reading to a static 32kish buffer was better, but still not particularly good. Reading to freshly malloced chunk, processing that and then letting it go allows kernel to work wonders under the hood. The difference in speed was enormous, but then again pattern matching is very fast complexitywise and more emphasis must be put on handling efficiency than perhaps is usually needed.

    0 讨论(0)
  • 2021-02-14 11:40

    If you have a relatively recent version of Perl, you shouldn't be using Sys::Mmap. You should be using PerlIO's mmap layer.

    Can you post the code you are using?

    0 讨论(0)
  • 2021-02-14 11:44

    Ok, here's another update. Using Sys::Mmap or PerlIO's ":mmap" attribute both works fine in perl, but only up to 2 GB files (the magic 32 bit limit). Once the file is more than 2 GB, the following problems appear:

    Using Sys::Mmap and substr for accessing the file, it seems that substr only accepts a 32 bit int for the position parameter, even on systems where perl supports 64 bit. There's at least one bug posted about it:

    #62646: Maximum string length with substr

    Using open(my $fh, "<:mmap", "bigfile.bin"), once the file is larger than 2 GB, it seems perl will either hang/or insist on reading the whole file on the first read (not sure which, I never ran it long enough to see if it completed), leading to dead slow performance.

    I haven't found any workaround to either of these, and I'm currently stuck with slow file (non mmap'ed) operations for working on these files. Unless I find a workaround I may have to implement the processing in C or another higher level language that supports mmap'ing huge files better.

    0 讨论(0)
  • 2021-02-14 11:44

    If I may plug my own module: I'd advice using File::Map instead of Sys::Mmap. It's much easier to use, and is less crash-prone than Sys::Mmap.

    0 讨论(0)
提交回复
热议问题