Reading binary file and looping over each byte

前端 未结 12 1088
孤街浪徒
孤街浪徒 2020-11-22 00:53

In Python, how do I read in a binary file and loop over each byte of that file?

12条回答
  •  孤街浪徒
    2020-11-22 01:13

    To read a file — one byte at a time (ignoring the buffering) — you could use the two-argument iter(callable, sentinel) built-in function:

    with open(filename, 'rb') as file:
        for byte in iter(lambda: file.read(1), b''):
            # Do stuff with byte
    

    It calls file.read(1) until it returns nothing b'' (empty bytestring). The memory doesn't grow unlimited for large files. You could pass buffering=0 to open(), to disable the buffering — it guarantees that only one byte is read per iteration (slow).

    with-statement closes the file automatically — including the case when the code underneath raises an exception.

    Despite the presence of internal buffering by default, it is still inefficient to process one byte at a time. For example, here's the blackhole.py utility that eats everything it is given:

    #!/usr/bin/env python3
    """Discard all input. `cat > /dev/null` analog."""
    import sys
    from functools import partial
    from collections import deque
    
    chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
    deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)
    

    Example:

    $ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py
    

    It processes ~1.5 GB/s when chunksize == 32768 on my machine and only ~7.5 MB/s when chunksize == 1. That is, it is 200 times slower to read one byte at a time. Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance.

    mmap allows you to treat a file as a bytearray and a file object simultaneously. It can serve as an alternative to loading the whole file in memory if you need access both interfaces. In particular, you can iterate one byte at a time over a memory-mapped file just using a plain for-loop:

    from mmap import ACCESS_READ, mmap
    
    with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
        for byte in s: # length is equal to the current file size
            # Do stuff with byte
    

    mmap supports the slice notation. For example, mm[i:i+len] returns len bytes from the file starting at position i. The context manager protocol is not supported before Python 3.2; you need to call mm.close() explicitly in this case. Iterating over each byte using mmap consumes more memory than file.read(1), but mmap is an order of magnitude faster.

提交回复
热议问题