While trying to use memory mapped files to create a multi-gigabyte file (around 13gb), I ran into what appears to be a problem with mmap(). The initial implementation was done
Edit: Upgrading to "proper answer". The problem is with the way that "dirty pages" are handled by Linux. I still want my system to flush dirty pages now and again, so I didn't allow it to have TOO many outstanding pages. But at the same time, I can show that this is what is going on.
I did this (with "sudo -i"):
# echo 80 > /proc/sys/vm/dirty_ratio
# echo 60 > /proc/sys/vm/dirty_background_ratio
Which gives these settings VM dirty settings:
grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:60
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:80
/proc/sys/vm/dirty_writeback_centisecs:500
This makes my benchmark run like this:
$ ./a.out m64 200000000
Setup Duration 33.1042 seconds
Linux: mmap64
size=1525 MB
Mapping Duration 30.6785 seconds
Overall Duration 91.7038 seconds
Compare with "before":
$ ./a.out m64 200000000
Setup Duration 33.7436 seconds
Linux: mmap64
size=1525
Mapping Duration 1467.49 seconds
Overall Duration 1501.89 seconds
which had these VM dirty settings:
grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500
I'm not sure exactly what settings I should use to get IDEAL performance whilst still not leaving all dirty pages sitting around in memory forever (meaning that if the system crashes, it takes much longer to write out to disk).
For history: Here's what I originally wrote as a "non-answer" - some comments here still apply...
Not REALLY an answer, but I find it rather interesting that if I change the code to first read the entire array, and the write it out, it's SIGNIFICANTLY faster, than doing both in the same loop. I appreciate that this is utterly useless if you need to deal with really huge data sets (bigger than memory). With the original code as posted, the time for 100M uint64 values is 134s. When I split the read and the write cycle, it's 43s.
This is the DoMapping
function [only code I've changed] after modification:
struct VI
{
uint32_t value;
uint32_t index;
};
void DoMapping(uint64_t* dest, size_t rowCount)
{
inputStream->seekg(0, std::ios::beg);
std::chrono::system_clock::time_point startTime = std::chrono::system_clock::now();
uint32_t index, value;
std::vector<VI> data;
for(size_t i = 0; i < rowCount; i++)
{
inputStream->read(reinterpret_cast<char*>(&index), static_cast<std::streamsize>(sizeof(uint32_t)));
inputStream->read(reinterpret_cast<char*>(&value), static_cast<std::streamsize>(sizeof(uint32_t)));
VI d = {index, value};
data.push_back(d);
}
for (size_t i = 0; i<rowCount; ++i)
{
value = data[i].value;
index = data[i].index;
dest[index] = value;
}
std::chrono::duration<double> mappingTime = std::chrono::system_clock::now() - startTime;
std::cout << "Mapping Duration " << mappingTime.count() << " seconds" << std::endl;
inputStream.reset();
}
I'm currently running a test with 200M records, which on my machine takes a significant amount of time (2000+ seconds without code-changes). It is very clear that the time taken is from disk-I/O, and I'm seeing IO-rates of 50-70MB/s, which is pretty good, as I don't really expect my rather unsophisticated setup to deliver much more than that. The improvement is not as good with the larger size, but still a decent improvement: 1502s total time, vs 2021s for the "read and write in the same loop".
Also, I'd like to point out that this is a rather terrible test for any system - the fact that Linux is notably worse than Windows is beside the point - you do NOT really want to map a large file and write 8 bytes [meaning the 4KB page has to be read in] to each page at random. If this reflects your REAL application, then you seriously should rethink your approach in some way. It will run fine when you have enough free memory that the whole memory-mapped region fits in RAM.
There is plenty of RAM in my system, so I believe that the problem is that Linux doesn't like too many mapped pages that are "dirty".
I have a feeling that this may have something to do with it: https://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages More explanation: http://www.westnet.com/~gsmith/content/linux-pdflush.htm
Unfortunately, I'm also very tired, and need to sleep. I'll see if I can experiment with these tomorrow - but don't hold your breath. Like I said, this is not REALLY an answer, but rather a long comment that doesn't really fit in a comment (and contains code, which is completely rubbish to read in a comment)