Max limit of bytes in method update of Hashlib Python module

前端 未结 3 1989
误落风尘
误落风尘 2020-12-21 09:32

I am trying to compute md5 hash of a file with the function hashlib.md5() from hashlib module.

So that I writed this piece of code:



        
相关标签:
3条回答
  • 2020-12-21 09:40

    Big (≈2**40) chunk sizes lead to MemoryError i.e., there is no limit other than available RAM. On the other hand bufsize is limited by 2**31-1 on my machine:

    import hashlib
    from functools import partial
    
    def md5(filename, chunksize=2**15, bufsize=-1):
        m = hashlib.md5()
        with open(filename, 'rb', bufsize) as f:
            for chunk in iter(partial(f.read, chunksize), b''):
                m.update(chunk)
        return m
    

    Big chunksize can be as slow as a very small one. Measure it.

    I find that for ≈10MB files the 2**15 chunksize is the fastest for the files I've tested.

    0 讨论(0)
  • 2020-12-21 09:55

    The buffer value is the number of bytes that is read and stored in memory at once, so yes, the only limit is your available memory.

    However, bigger values are not automatically faster. At some point, you might run into memory paging issues or other slowdowns with memory allocation if the buffer is too large. You should experiment with larger and larger values until you hit diminishing returns in speed.

    0 讨论(0)
  • 2020-12-21 09:56

    To be able to handle arbitrarily large files you need to read them in blocks. The size of such blocks should preferably be a power of 2, and in the case of md5 the minimum possible block consists of 64 bytes (512 bits) as 512-bit blocks are the units on which the algorithm operates.

    But if we go beyond that and try to establish an exact criterion whether, say 2048-byte block is better than 4096-byte block... we will likely fail. This needs to be very carefully tested and measured, and almost always the value is being chosen at will, judging from experience.

    0 讨论(0)
提交回复
热议问题