In Python, how do I read in a binary file and loop over each byte of that file?
This post itself is not a direct answer to the question. What it is instead is a data-driven extensible benchmark that can be used to compare many of the answers (and variations of utilizing new features added in later, more modern, versions of Python) that have been posted to this question — and should therefore be helpful in determining which has the best performance.
In a few cases I've modified the code in the referenced answer to make it compatible with the benchmark framework.
First, here are the results for what currently are the latest versions of Python 2 & 3:
Fastest to slowest execution speeds with 32-bit Python 2.7.16
numpy version 1.16.5
Test file size: 1,024 KiB
100 executions, best of 3 repetitions
1 Tcll (array.array) : 3.8943 secs, rel speed 1.00x, 0.00% slower (262.95 KiB/sec)
2 Vinay Sajip (read all into memory) : 4.1164 secs, rel speed 1.06x, 5.71% slower (248.76 KiB/sec)
3 codeape + iter + partial : 4.1616 secs, rel speed 1.07x, 6.87% slower (246.06 KiB/sec)
4 codeape : 4.1889 secs, rel speed 1.08x, 7.57% slower (244.46 KiB/sec)
5 Vinay Sajip (chunked) : 4.1977 secs, rel speed 1.08x, 7.79% slower (243.94 KiB/sec)
6 Aaron Hall (Py 2 version) : 4.2417 secs, rel speed 1.09x, 8.92% slower (241.41 KiB/sec)
7 gerrit (struct) : 4.2561 secs, rel speed 1.09x, 9.29% slower (240.59 KiB/sec)
8 Rick M. (numpy) : 8.1398 secs, rel speed 2.09x, 109.02% slower (125.80 KiB/sec)
9 Skurmedel : 31.3264 secs, rel speed 8.04x, 704.42% slower ( 32.69 KiB/sec)
Benchmark runtime (min:sec) - 03:26
Fastest to slowest execution speeds with 32-bit Python 3.8.0
numpy version 1.17.4
Test file size: 1,024 KiB
100 executions, best of 3 repetitions
1 Vinay Sajip + "yield from" + "walrus operator" : 3.5235 secs, rel speed 1.00x, 0.00% slower (290.62 KiB/sec)
2 Aaron Hall + "yield from" : 3.5284 secs, rel speed 1.00x, 0.14% slower (290.22 KiB/sec)
3 codeape + iter + partial + "yield from" : 3.5303 secs, rel speed 1.00x, 0.19% slower (290.06 KiB/sec)
4 Vinay Sajip + "yield from" : 3.5312 secs, rel speed 1.00x, 0.22% slower (289.99 KiB/sec)
5 codeape + "yield from" + "walrus operator" : 3.5370 secs, rel speed 1.00x, 0.38% slower (289.51 KiB/sec)
6 codeape + "yield from" : 3.5390 secs, rel speed 1.00x, 0.44% slower (289.35 KiB/sec)
7 jfs (mmap) : 4.0612 secs, rel speed 1.15x, 15.26% slower (252.14 KiB/sec)
8 Vinay Sajip (read all into memory) : 4.5948 secs, rel speed 1.30x, 30.40% slower (222.86 KiB/sec)
9 codeape + iter + partial : 4.5994 secs, rel speed 1.31x, 30.54% slower (222.64 KiB/sec)
10 codeape : 4.5995 secs, rel speed 1.31x, 30.54% slower (222.63 KiB/sec)
11 Vinay Sajip (chunked) : 4.6110 secs, rel speed 1.31x, 30.87% slower (222.08 KiB/sec)
12 Aaron Hall (Py 2 version) : 4.6292 secs, rel speed 1.31x, 31.38% slower (221.20 KiB/sec)
13 Tcll (array.array) : 4.8627 secs, rel speed 1.38x, 38.01% slower (210.58 KiB/sec)
14 gerrit (struct) : 5.0816 secs, rel speed 1.44x, 44.22% slower (201.51 KiB/sec)
15 Rick M. (numpy) + "yield from" : 11.8084 secs, rel speed 3.35x, 235.13% slower ( 86.72 KiB/sec)
16 Skurmedel : 11.8806 secs, rel speed 3.37x, 237.18% slower ( 86.19 KiB/sec)
17 Rick M. (numpy) : 13.3860 secs, rel speed 3.80x, 279.91% slower ( 76.50 KiB/sec)
Benchmark runtime (min:sec) - 04:47
I also ran it with a much larger 10 MiB test file (which took nearly an hour to run) and got performance results which were comparable to those shown above.
Here's the code used to do the benchmarking:
from __future__ import print_function
import array
import atexit
from collections import deque, namedtuple
import io
from mmap import ACCESS_READ, mmap
import numpy as np
from operator import attrgetter
import os
import random
import struct
import sys
import tempfile
from textwrap import dedent
import time
import timeit
import traceback
try:
xrange
except NameError: # Python 3
xrange = range
class KiB(int):
""" KibiBytes - multiples of the byte units for quantities of information. """
def __new__(self, value=0):
return 1024*value
BIG_TEST_FILE = 1 # MiBs or 0 for a small file.
SML_TEST_FILE = KiB(64)
EXECUTIONS = 100 # Number of times each "algorithm" is executed per timing run.
TIMINGS = 3 # Number of timing runs.
CHUNK_SIZE = KiB(8)
if BIG_TEST_FILE:
FILE_SIZE = KiB(1024) * BIG_TEST_FILE
else:
FILE_SIZE = SML_TEST_FILE # For quicker testing.
# Common setup for all algorithms -- prefixed to each algorithm's setup.
COMMON_SETUP = dedent("""
# Make accessible in algorithms.
from __main__ import array, deque, get_buffer_size, mmap, np, struct
from __main__ import ACCESS_READ, CHUNK_SIZE, FILE_SIZE, TEMP_FILENAME
from functools import partial
try:
xrange
except NameError: # Python 3
xrange = range
""")
def get_buffer_size(path):
""" Determine optimal buffer size for reading files. """
st = os.stat(path)
try:
bufsize = st.st_blksize # Available on some Unix systems (like Linux)
except AttributeError:
bufsize = io.DEFAULT_BUFFER_SIZE
return bufsize
# Utility primarily for use when embedding additional algorithms into benchmark.
VERIFY_NUM_READ = """
# Verify generator reads correct number of bytes (assumes values are correct).
bytes_read = sum(1 for _ in file_byte_iterator(TEMP_FILENAME))
assert bytes_read == FILE_SIZE, \
'Wrong number of bytes generated: got {:,} instead of {:,}'.format(
bytes_read, FILE_SIZE)
"""
TIMING = namedtuple('TIMING', 'label, exec_time')
class Algorithm(namedtuple('CodeFragments', 'setup, test')):
# Default timeit "stmt" code fragment.
_TEST = """
#for b in file_byte_iterator(TEMP_FILENAME): # Loop over every byte.
# pass # Do stuff with byte...
deque(file_byte_iterator(TEMP_FILENAME), maxlen=0) # Data sink.
"""
# Must overload __new__ because (named)tuples are immutable.
def __new__(cls, setup, test=None):
""" Dedent (unindent) code fragment string arguments.
Args:
`setup` -- Code fragment that defines things used by `test` code.
In this case it should define a generator function named
`file_byte_iterator()` that will be passed that name of a test file
of binary data. This code is not timed.
`test` -- Code fragment that uses things defined in `setup` code.
Defaults to _TEST. This is the code that's timed.
"""
test = cls._TEST if test is None else test # Use default unless one is provided.
# Uncomment to replace all performance tests with one that verifies the correct
# number of bytes values are being generated by the file_byte_iterator function.
#test = VERIFY_NUM_READ
return tuple.__new__(cls, (dedent(setup), dedent(test)))
algorithms = {
'Aaron Hall (Py 2 version)': Algorithm("""
def file_byte_iterator(path):
with open(path, "rb") as file:
callable = partial(file.read, 1024)
sentinel = bytes() # or b''
for chunk in iter(callable, sentinel):
for byte in chunk:
yield byte
"""),
"codeape": Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
while True:
chunk = f.read(chunksize)
if chunk:
for b in chunk:
yield b
else:
break
"""),
"codeape + iter + partial": Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
for chunk in iter(partial(f.read, chunksize), b''):
for b in chunk:
yield b
"""),
"gerrit (struct)": Algorithm("""
def file_byte_iterator(filename):
with open(filename, "rb") as f:
fmt = '{}B'.format(FILE_SIZE) # Reads entire file at once.
for b in struct.unpack(fmt, f.read()):
yield b
"""),
'Rick M. (numpy)': Algorithm("""
def file_byte_iterator(filename):
for byte in np.fromfile(filename, 'u1'):
yield byte
"""),
"Skurmedel": Algorithm("""
def file_byte_iterator(filename):
with open(filename, "rb") as f:
byte = f.read(1)
while byte:
yield byte
byte = f.read(1)
"""),
"Tcll (array.array)": Algorithm("""
def file_byte_iterator(filename):
with open(filename, "rb") as f:
arr = array.array('B')
arr.fromfile(f, FILE_SIZE) # Reads entire file at once.
for b in arr:
yield b
"""),
"Vinay Sajip (read all into memory)": Algorithm("""
def file_byte_iterator(filename):
with open(filename, "rb") as f:
bytes_read = f.read() # Reads entire file at once.
for b in bytes_read:
yield b
"""),
"Vinay Sajip (chunked)": Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
chunk = f.read(chunksize)
while chunk:
for b in chunk:
yield b
chunk = f.read(chunksize)
"""),
} # End algorithms
#
# Versions of algorithms that will only work in certain releases (or better) of Python.
#
if sys.version_info >= (3, 3):
algorithms.update({
'codeape + iter + partial + "yield from"': Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
for chunk in iter(partial(f.read, chunksize), b''):
yield from chunk
"""),
'codeape + "yield from"': Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
while True:
chunk = f.read(chunksize)
if chunk:
yield from chunk
else:
break
"""),
"jfs (mmap)": Algorithm("""
def file_byte_iterator(filename):
with open(filename, "rb") as f, \
mmap(f.fileno(), 0, access=ACCESS_READ) as s:
yield from s
"""),
'Rick M. (numpy) + "yield from"': Algorithm("""
def file_byte_iterator(filename):
# data = np.fromfile(filename, 'u1')
yield from np.fromfile(filename, 'u1')
"""),
'Vinay Sajip + "yield from"': Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
chunk = f.read(chunksize)
while chunk:
yield from chunk # Added in Py 3.3
chunk = f.read(chunksize)
"""),
}) # End Python 3.3 update.
if sys.version_info >= (3, 5):
algorithms.update({
'Aaron Hall + "yield from"': Algorithm("""
from pathlib import Path
def file_byte_iterator(path):
''' Given a path, return an iterator over the file
that lazily loads the file.
'''
path = Path(path)
bufsize = get_buffer_size(path)
with path.open('rb') as file:
reader = partial(file.read1, bufsize)
for chunk in iter(reader, bytes()):
yield from chunk
"""),
}) # End Python 3.5 update.
if sys.version_info >= (3, 8, 0):
algorithms.update({
'Vinay Sajip + "yield from" + "walrus operator"': Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
while chunk := f.read(chunksize):
yield from chunk # Added in Py 3.3
"""),
'codeape + "yield from" + "walrus operator"': Algorithm("""
def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
with open(filename, "rb") as f:
while chunk := f.read(chunksize):
yield from chunk
"""),
}) # End Python 3.8.0 update.update.
#### Main ####
def main():
global TEMP_FILENAME
def cleanup():
""" Clean up after testing is completed. """
try:
os.remove(TEMP_FILENAME) # Delete the temporary file.
except Exception:
pass
atexit.register(cleanup)
# Create a named temporary binary file of pseudo-random bytes for testing.
fd, TEMP_FILENAME = tempfile.mkstemp('.bin')
with os.fdopen(fd, 'wb') as file:
os.write(fd, bytearray(random.randrange(256) for _ in range(FILE_SIZE)))
# Execute and time each algorithm, gather results.
start_time = time.time() # To determine how long testing itself takes.
timings = []
for label in algorithms:
try:
timing = TIMING(label,
min(timeit.repeat(algorithms[label].test,
setup=COMMON_SETUP + algorithms[label].setup,
repeat=TIMINGS, number=EXECUTIONS)))
except Exception as exc:
print('{} occurred timing the algorithm: "{}"\n {}'.format(
type(exc).__name__, label, exc))
traceback.print_exc(file=sys.stdout) # Redirect to stdout.
sys.exit(1)
timings.append(timing)
# Report results.
print('Fastest to slowest execution speeds with {}-bit Python {}.{}.{}'.format(
64 if sys.maxsize > 2**32 else 32, *sys.version_info[:3]))
print(' numpy version {}'.format(np.version.full_version))
print(' Test file size: {:,} KiB'.format(FILE_SIZE // KiB(1)))
print(' {:,d} executions, best of {:d} repetitions'.format(EXECUTIONS, TIMINGS))
print()
longest = max(len(timing.label) for timing in timings) # Len of longest identifier.
ranked = sorted(timings, key=attrgetter('exec_time')) # Sort so fastest is first.
fastest = ranked[0].exec_time
for rank, timing in enumerate(ranked, 1):
print('{:<2d} {:>{width}} : {:8.4f} secs, rel speed {:6.2f}x, {:6.2f}% slower '
'({:6.2f} KiB/sec)'.format(
rank,
timing.label, timing.exec_time, round(timing.exec_time/fastest, 2),
round((timing.exec_time/fastest - 1) * 100, 2),
(FILE_SIZE/timing.exec_time) / KiB(1), # per sec.
width=longest))
print()
mins, secs = divmod(time.time()-start_time, 60)
print('Benchmark runtime (min:sec) - {:02d}:{:02d}'.format(int(mins),
int(round(secs))))
main()
if you are looking for something speedy, here's a method I've been using that's worked for years:
from array import array
with open( path, 'rb' ) as file:
data = array( 'B', file.read() ) # buffer the file
# evaluate it's data
for byte in data:
v = byte # int value
c = chr(byte)
if you want to iterate chars instead of ints, you can simply use data = file.read()
, which should be a bytes() object in py3.
Here's an example of reading Network endian data using Numpy fromfile addressing @Nirmal comments above:
dtheader= np.dtype([('Start Name','b', (4,)),
('Message Type', np.int32, (1,)),
('Instance', np.int32, (1,)),
('NumItems', np.int32, (1,)),
('Length', np.int32, (1,)),
('ComplexArray', np.int32, (1,))])
dtheader=dtheader.newbyteorder('>')
headerinfo = np.fromfile(iqfile, dtype=dtheader, count=1)
print(raw['Start Name'])
I hope this helps. The problem is that fromfile doesn't recognize and EOF and allow gracefully breaking out of the loop for files of arbitrary size.
Python 2.4 and Earlier
f = open("myfile", "rb")
try:
byte = f.read(1)
while byte != "":
# Do stuff with byte.
byte = f.read(1)
finally:
f.close()
Python 2.5-2.7
with open("myfile", "rb") as f:
byte = f.read(1)
while byte != "":
# Do stuff with byte.
byte = f.read(1)
Note that the with statement is not available in versions of Python below 2.5. To use it in v 2.5 you'll need to import it:
from __future__ import with_statement
In 2.6 this is not needed.
Python 3
In Python 3, it's a bit different. We will no longer get raw characters from the stream in byte mode but byte objects, thus we need to alter the condition:
with open("myfile", "rb") as f:
byte = f.read(1)
while byte != b"":
# Do stuff with byte.
byte = f.read(1)
Or as benhoyt says, skip the not equal and take advantage of the fact that b""
evaluates to false. This makes the code compatible between 2.6 and 3.x without any changes. It would also save you from changing the condition if you go from byte mode to text or the reverse.
with open("myfile", "rb") as f:
byte = f.read(1)
while byte:
# Do stuff with byte.
byte = f.read(1)
python 3.8
From now on thanks to := operator the above code can be written in a shorter way.
with open("myfile", "rb") as f:
while (byte := f.read(1)):
# Do stuff with byte.
After trying all the above and using the answer from @Aaron Hall, I was getting memory errors for a ~90 Mb file on a computer running Window 10, 8 Gb RAM and Python 3.5 32-bit. I was recommended by a colleague to use numpy
instead and it works wonders.
By far, the fastest to read an entire binary file (that I have tested) is:
import numpy as np
file = "binary_file.bin"
data = np.fromfile(file, 'u1')
Reference
Multitudes faster than any other methods so far. Hope it helps someone!
Python 3, read all of the file at once:
with open("filename", "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
print(data)
You can iterate whatever you want using data
variable.