I am trying to compute md5 hash of a file with the function hashlib.md5() from hashlib module.
So that I writed this piece of code:
Big (≈2**40
) chunk sizes lead to MemoryError
i.e., there is no limit other than available RAM. On the other hand bufsize
is limited by 2**31-1
on my machine:
import hashlib
from functools import partial
def md5(filename, chunksize=2**15, bufsize=-1):
m = hashlib.md5()
with open(filename, 'rb', bufsize) as f:
for chunk in iter(partial(f.read, chunksize), b''):
m.update(chunk)
return m
Big chunksize
can be as slow as a very small one. Measure it.
I find that for ≈10
MB files the 2**15
chunksize
is the fastest for the files I've tested.
The buffer value is the number of bytes that is read and stored in memory at once, so yes, the only limit is your available memory.
However, bigger values are not automatically faster. At some point, you might run into memory paging issues or other slowdowns with memory allocation if the buffer is too large. You should experiment with larger and larger values until you hit diminishing returns in speed.
To be able to handle arbitrarily large files you need to read them in blocks. The size of such blocks should preferably be a power of 2, and in the case of md5 the minimum possible block consists of 64 bytes (512 bits) as 512-bit blocks are the units on which the algorithm operates.
But if we go beyond that and try to establish an exact criterion whether, say 2048-byte block is better than 4096-byte block... we will likely fail. This needs to be very carefully tested and measured, and almost always the value is being chosen at will, judging from experience.