I\'m working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I\'ve got a first
mmap is way faster. You might write a simple benchmark to prove it to yourself:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
versus:
const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;
int fd = open("filename.bin", O_RDONLY);
while (off < file_size)
{
data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
// do stuff with data
munmap(data, page_size);
off += page_size;
}
Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size
, for instance), but it really shouldn't be much more complicated than this.
If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).
A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(
Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.
Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap
, viewed as a black box, will always always always be substantially faster than read
. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read
. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap
. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read
may be suitable without sacrificing measurably any performance.
Edit to clean up answer list: @jbl:
the sliding window mmap sounds interesting. Can you say a little more about it?
Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream
).
Boost::Iostreams
already has a mapped_file Source, but the problem was that it was mmap
ping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack
files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams
, I implemented a Source, which is more or less another view of the interaction between std::streambuf
and std::istream
. You could also try a similar approach by just inheriting std::filebuf
into a mapped_filebuf
and similarly, inheriting std::fstream
into a mapped_fstream
. It's the interaction between the two that's difficult to get right. Boost::Iostreams
has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.