In Python, how do I read in a binary file and loop over each byte of that file?
To read a file — one byte at a time (ignoring the buffering) — you could use the two-argument iter(callable, sentinel) built-in function:
with open(filename, 'rb') as file:
for byte in iter(lambda: file.read(1), b''):
# Do stuff with byte
It calls file.read(1)
until it returns nothing b''
(empty bytestring). The memory doesn't grow unlimited for large files. You could pass buffering=0
to open()
, to disable the buffering — it guarantees that only one byte is read per iteration (slow).
with
-statement closes the file automatically — including the case when the code underneath raises an exception.
Despite the presence of internal buffering by default, it is still inefficient to process one byte at a time. For example, here's the blackhole.py
utility that eats everything it is given:
#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque
chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)
Example:
$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py
It processes ~1.5 GB/s when chunksize == 32768
on my machine and only ~7.5 MB/s when chunksize == 1
. That is, it is 200 times slower to read one byte at a time. Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance.
mmap allows you to treat a file as a bytearray and a file object simultaneously. It can serve as an alternative to loading the whole file in memory if you need access both interfaces. In particular, you can iterate one byte at a time over a memory-mapped file just using a plain for
-loop:
from mmap import ACCESS_READ, mmap
with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
for byte in s: # length is equal to the current file size
# Do stuff with byte
mmap
supports the slice notation. For example, mm[i:i+len]
returns len
bytes from the file starting at position i
. The context manager protocol is not supported before Python 3.2; you need to call mm.close()
explicitly in this case. Iterating over each byte using mmap
consumes more memory than file.read(1)
, but mmap
is an order of magnitude faster.