I need to process some data that is a few hundred times bigger than RAM. I would like to read in a large chunk, process it, save the result, free the memory and repeat. Is
The general key is that you want to process the file iteratively.
If you're just dealing with a text file, this is trivial: for line in f:
only reads in one line at a time. (Actually it buffers things up, but the buffers are small enough that you don't have to worry about it.)
If you're dealing with some other specific file type, like a numpy binary file, a CSV file, an XML document, etc., there are generally similar special-purpose solutions, but nobody can describe them to you unless you tell us what kind of data you have.
But what if you have a general binary file?
First, the read method takes an optional max bytes to read. So, instead of this:
data = f.read()
process(data)
You can do this:
while True:
data = f.read(8192)
if not data:
break
process(data)
You may want to instead write a function like this:
def chunks(f):
while True:
data = f.read(8192)
if not data:
break
yield data
Then you can just do this:
for chunk in chunks(f):
process(chunk)
You could also do this with the two-argument iter
, but many people find that a bit obscure:
for chunk in iter(partial(f.read, 8192), b''):
process(chunk)
Either way, this option applies to all of the other variants below (except for a single mmap
, which is trivial enough that there's no point).
There's nothing magic about the number 8192 there. You generally do want a power of 2, and ideally a multiple of your system's page size. beyond that, your performance won't vary that much whether you're using 4KB or 4MB—and if it does, you'll have to test what works best for your use case.
Anyway, this assumes you can just process each 8K at a time without keeping around any context. If you're, e.g., feeding data into a progressive decoder or hasher or something, that's perfect.
But if you need to process one "chunk" at a time, your chunks could end up straddling an 8K boundary. How do you deal with that?
It depends on how your chunks are delimited in the file, but the basic idea is pretty simple. For example, let's say you use NUL bytes as a separator (not very likely, but easy to show as a toy example).
data = b''
while True:
buf = f.read(8192)
if not buf:
process(data)
break
data += buf
chunks = data.split(b'\0')
for chunk in chunks[:-1]:
process(chunk)
data = chunks[-1]
This kind of code is very common in networking (because sockets
can't just "read all", so you always have to read into a buffer and chunk into messages), so you may find some useful examples in networking code that uses a protocol similar to your file format.
Alternatively, you can use mmap.
If your virtual memory size is larger than the file, this is trivial:
with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ) as m:
process(m)
Now m
acts like a giant bytes
object, just as if you'd called read()
to read the whole thing into memory—but the OS will automatically page bits in and out of memory as necessary.
If you're trying to read a file too big to fit into your virtual memory size (e.g., a 4GB file with 32-bit Python, or a 20EB file with 64-bit Python—which is only likely to happen in 2013 if you're reading a sparse or virtual file like, say, the VM file for another process on linux), you have to implement windowing—mmap in a piece of the file at a time. For example:
windowsize = 8*1024*1024
size = os.fstat(f.fileno()).st_size
for start in range(0, size, window size):
with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ,
length=windowsize, offset=start) as m:
process(m)
Of course mapping windows has the same issue as reading chunks if you need to delimit things, and you can solve it the same way.
But, as an optimization, instead of buffering, you can just slide the window forward to the page containing the end of the last complete message, instead of 8MB at a time, and then you can avoid any copying. This is a bit more complicated, so if you want to do it, search for something like "sliding mmap window", and write a new question if you get stuck.