I got some huge files I need to parse, and people have been recommending mmap because this should avoid having to allocate the entire file in-memory.
But looking at
top
has many memory-related columns. Most of them are based on the size of the memory space mapped to the process; including any shared libraries, swapped out RAM, and mmapped space.
Check the RES
column, this is related to the physical RAM currently in use. I think (but not sure) it would include the RAM used to 'cache' the mmap'ped file
You may have been offered the wrong advice.
Memory mapped files (mmap) will use more and more memory as you parse through them. When physical memory becomes low, the kernel will unmap sections of the file from physical memory based on its LRU (least recently used) algorithm. But the LRU is also global. The LRU may also force other processes to swap pages to disk, and reduce the disk cache. This can have a severely negative affect on the performance on other processes and the system as a whole.
If you are linearly reading through files, like counting the number of lines, mmap is a bad choice, as it will fill physical memory before release memory back to the system. It would be better to use traditional I/O methods which stream or read in a block at a time. That way memory can be released immediately afterwards.
If you are randomly accessing a file, mmap is an okay choice. But it's not optimal since you would still be relying the kernel's general LRU algorithm, but it’s faster to use than writing your caching mechanism.
In general, I would never recommend anyone use mmap, except for some extreme performance edge cases - like accessing the file from multiple processes or threads at the same time, or when the file is small in relationship to the amount of free available memory.