Reading binary file and looping over each byte

前端未结

关注

 12  1101

孤街浪徒

In Python, how do I read in a binary file and loop over each byte of that file?

相关标签:

12条回答

孤街浪徒

2020-11-22 01:13
To read a file — one byte at a time (ignoring the buffering) — you could use the two-argument iter(callable, sentinel) built-in function:
```
with open(filename, 'rb') as file:
    for byte in iter(lambda: file.read(1), b''):
        # Do stuff with byte
```
It calls file.read(1) until it returns nothing b'' (empty bytestring). The memory doesn't grow unlimited for large files. You could pass buffering=0 to open(), to disable the buffering — it guarantees that only one byte is read per iteration (slow).

with-statement closes the file automatically — including the case when the code underneath raises an exception.

Despite the presence of internal buffering by default, it is still inefficient to process one byte at a time. For example, here's the blackhole.py utility that eats everything it is given:
```
#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque

chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)
```
Example:
```
$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py
```
It processes ~1.5 GB/s when chunksize == 32768 on my machine and only ~7.5 MB/s when chunksize == 1. That is, it is 200 times slower to read one byte at a time. Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance.

mmap allows you to treat a file as a bytearray and a file object simultaneously. It can serve as an alternative to loading the whole file in memory if you need access both interfaces. In particular, you can iterate one byte at a time over a memory-mapped file just using a plain for-loop:
```
from mmap import ACCESS_READ, mmap

with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    for byte in s: # length is equal to the current file size
        # Do stuff with byte
```
mmap supports the slice notation. For example, mm[i:i+len] returns len bytes from the file starting at position i. The context manager protocol is not supported before Python 3.2; you need to call mm.close() explicitly in this case. Iterating over each byte using mmap consumes more memory than file.read(1), but mmap is an order of magnitude faster.
0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-11-22 01:16
If you have a lot of binary data to read, you might want to consider the struct module. It is documented as converting "between C and Python types", but of course, bytes are bytes, and whether those were created as C types does not matter. For example, if your binary data contains two 2-byte integers and one 4-byte integer, you can read them as follows (example taken from struct documentation):
```
>>> struct.unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
```
You might find this more convenient, faster, or both, than explicitly looping over the content of a file.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2020-11-22 01:20
To sum up all the brilliant points of chrispy, Skurmedel, Ben Hoyt and Peter Hansen, this would be the optimal solution for processing a binary file one byte at a time:
```
with open("myfile", "rb") as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        do_stuff_with(ord(byte))
```
For python versions 2.6 and above, because:
- python buffers internally - no need to read chunks
- DRY principle - do not repeat the read line
- with statement ensures a clean file close
- 'byte' evaluates to false when there are no more bytes (not when a byte is zero)
Or use J. F. Sebastians solution for improved speed
```
from functools import partial

with open(filename, 'rb') as file:
    for byte in iter(partial(file.read, 1), b''):
        # Do stuff with byte
```
Or if you want it as a generator function like demonstrated by codeape:
```
def bytes_from_file(filename):
    with open(filename, "rb") as f:
        while True:
            byte = f.read(1)
            if not byte:
                break
            yield(ord(byte))

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-11-22 01:21
If the file is not too big that holding it in memory is a problem:
```
with open("filename", "rb") as f:
    bytes_read = f.read()
for b in bytes_read:
    process_byte(b)
```
where process_byte represents some operation you want to perform on the passed-in byte.

If you want to process a chunk at a time:
```
with open("filename", "rb") as f:
    bytes_read = f.read(CHUNKSIZE)
    while bytes_read:
        for b in bytes_read:
            process_byte(b)
        bytes_read = f.read(CHUNKSIZE)
```
The with statement is available in Python 2.5 and greater.
0 讨论(0)
发布评论:

提交评论
- 加载中...

迷失自我

2020-11-22 01:22

Reading binary file in Python and looping over each byte

New in Python 3.5 is the pathlib module, which has a convenience method specifically to read in a file as bytes, allowing us to iterate over the bytes. I consider this a decent (if quick and dirty) answer:

import pathlib

for byte in pathlib.Path(path).read_bytes():
    print(byte)

Interesting that this is the only answer to mention pathlib.

In Python 2, you probably would do this (as Vinay Sajip also suggests):

with open(path, 'b') as file:
    for byte in file.read():
        print(byte)

In the case that the file may be too large to iterate over in-memory, you would chunk it, idiomatically, using the iter function with the callable, sentinel signature - the Python 2 version:

with open(path, 'b') as file:
    callable = lambda: file.read(1024)
    sentinel = bytes() # or b''
    for chunk in iter(callable, sentinel): 
        for byte in chunk:
            print(byte)

(Several other answers mention this, but few offer a sensible read size.)

Best practice for large files or buffered/interactive reading

Let's create a function to do this, including idiomatic uses of the standard library for Python 3.5+:

from pathlib import Path
from functools import partial
from io import DEFAULT_BUFFER_SIZE

def file_byte_iterator(path):
    """given a path, return an iterator over the file
    that lazily loads the file
    """
    path = Path(path)
    with path.open('rb') as file:
        reader = partial(file.read1, DEFAULT_BUFFER_SIZE)
        file_iterator = iter(reader, bytes())
        for chunk in file_iterator:
            yield from chunk

Note that we use file.read1. file.read blocks until it gets all the bytes requested of it or EOF. file.read1 allows us to avoid blocking, and it can return more quickly because of this. No other answers mention this as well.

Demonstration of best practice usage:

Let's make a file with a megabyte (actually mebibyte) of pseudorandom data:

import random
import pathlib
path = 'pseudorandom_bytes'
pathobj = pathlib.Path(path)

pathobj.write_bytes(
  bytes(random.randint(0, 255) for _ in range(2**20)))

Now let's iterate over it and materialize it in memory:

>>> l = list(file_byte_iterator(path))
>>> len(l)
1048576

We can inspect any part of the data, for example, the last 100 and first 100 bytes:

>>> l[-100:]
[208, 5, 156, 186, 58, 107, 24, 12, 75, 15, 1, 252, 216, 183, 235, 6, 136, 50, 222, 218, 7, 65, 234, 129, 240, 195, 165, 215, 245, 201, 222, 95, 87, 71, 232, 235, 36, 224, 190, 185, 12, 40, 131, 54, 79, 93, 210, 6, 154, 184, 82, 222, 80, 141, 117, 110, 254, 82, 29, 166, 91, 42, 232, 72, 231, 235, 33, 180, 238, 29, 61, 250, 38, 86, 120, 38, 49, 141, 17, 190, 191, 107, 95, 223, 222, 162, 116, 153, 232, 85, 100, 97, 41, 61, 219, 233, 237, 55, 246, 181]
>>> l[:100]
[28, 172, 79, 126, 36, 99, 103, 191, 146, 225, 24, 48, 113, 187, 48, 185, 31, 142, 216, 187, 27, 146, 215, 61, 111, 218, 171, 4, 160, 250, 110, 51, 128, 106, 3, 10, 116, 123, 128, 31, 73, 152, 58, 49, 184, 223, 17, 176, 166, 195, 6, 35, 206, 206, 39, 231, 89, 249, 21, 112, 168, 4, 88, 169, 215, 132, 255, 168, 129, 127, 60, 252, 244, 160, 80, 155, 246, 147, 234, 227, 157, 137, 101, 84, 115, 103, 77, 44, 84, 134, 140, 77, 224, 176, 242, 254, 171, 115, 193, 29]

Don't iterate by lines for binary files

Don't do the following - this pulls a chunk of arbitrary size until it gets to a newline character - too slow when the chunks are too small, and possibly too large as well:

    with open(path, 'rb') as file:
        for chunk in file: # text newline iteration - not for bytes
            yield from chunk

The above is only good for what are semantically human readable text files (like plain text, code, markup, markdown etc... essentially anything ascii, utf, latin, etc... encoded) that you should open without the 'b' flag.

0 讨论(0)

栀梦

2020-11-22 01:26

This generator yields bytes from a file, reading the file in chunks:

def bytes_from_file(filename, chunksize=8192):
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(chunksize)
            if chunk:
                for b in chunk:
                    yield b
            else:
                break

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

See the Python documentation for information on iterators and generators.

0 讨论(0)

1 2 下一页