I face the challenge of reading/writing files (in Gigs) line by line.
Reading many forum entries and sites (including a bunch of SO\'s), mmap was suggested as the f
Whoever told you to use mmap
does not know very much about modern machines.
The performance advantages of mmap
are a total myth. In the words of Linus Torvalds:
Yes, memory is "slow", but dammit, so is mmap().
The problem with mmap
is that every time you touch a page in the mapped region for the first time, it traps into the kernel and actually maps the page into your address space, playing havoc with the TLB.
Try a simple benchmark reading a big file 8K at a time usingread
and then again with mmap
. (Using the same 8K buffer over and over.) You will almost certainly find that read
is actually faster.
Your problem was never with getting data out of the kernel; it was with how you handle the data after that. Minimize the work you are doing character-at-a-time; just scan to find the newline and then do a single operation on the block. Personally, I would go back to the read
implementation, using (and re-using) a buffer that fits in the L1 cache (8K or so).
Or at least, I would try a simple read
vs. mmap
benchmark to see which is actually faster on your platform.
[Update]
I found a couple more sets of commentary from Mr. Torvalds:
http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0728.html http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0775.html
The summary:
And on top of that you still have the actual CPU TLB miss costs etc. Which can often be avoided if you just re-read into the same area instead of being excessively clever with memory management just to avoid a copy.
memcpy() (ie "read()" in this case) is always going to be faster in many cases, just because it avoids all the extra complexity. While mmap() is going to be faster in other cases.
In my experience, reading and processing a large file sequentially is one of the "many cases" where using (and re-using) a modest-sized buffer with read
/write
performs significantly better than mmap
.
You're using stringstream
s to store the lines you identify. This is not comparable with the getline implementation, the stringstream itself adds overhead. As other suggested, you can store the beginning of the string as a char*
, and maybe the length of the line (or a pointer to the end of the line). The body of the read would be something like:
char* str_start = map;
char* str_end;
for (long i = 0; i <= FILESIZE; ++i) {
if (map[i] == '\n') {
str_end = map + i;
{
// C style tokenizing of the string str_start to str_end
// If you want, you can build a std::string like:
// std::string line(str_start,str_end);
// but note that this implies a memory copy.
}
str_start = map + i + 1;
}
}
Note also that this is much more efficient because you don't process anything in each char (in your version you were adding the character to the stringstream
).
You can use memchr
to find line endings. It will be much faster than adding to a stringstream
one character at a time.
The real power of mmap is being able to freely seek in a file, use its contents directly as a pointer, and avoid the overhead of copying data from kernel cache memory to userspace. However, your code sample is not taking advantage of this.
In your loop, you scan the buffer one character at a time, appending to a stringstream
. The stringstream
doesn't know how long the string is, and so has to reallocate several times in the process. At this point you've killed off any performance increase from using mmap
- even the standard getline implementation avoids multiple reallocations (by using a 128-byte on-stack buffer, in the GNU C++ implementation).
If you want to use mmap to its fullest power:
strnchr
or memchr
to find newlines; these make use of hand-rolled assembler and other optimizations to run faster than most open-coded search loops.