Is there a faster way (than this) to calculate the hash of a file (using hashlib) in Python?

醉酒当歌 提交于 2019-12-05 22:27:52

Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.

  • Using your first method required 21 seconds.
  • Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
  • Using the following function with a buffer size of 8096 required 17 seconds.
  • Using the following function with a buffer size of 32767 required 11 seconds.
  • Using the following function with a buffer size of 65536 required 8 seconds.
  • Using the following function with a buffer size of 131072 required 8 seconds.
  • Using the following function with a buffer size of 1048576 required 12 seconds.

def md5_speedcheck(path, size): pts = time.process_time() ats = time.time() m = hashlib.md5() with open(path, 'rb') as f: b = f.read(size) while len(b) > 0: m.update(b) b = f.read(size) print("{0:.3f} s".format(time.process_time() - pts)) print("{0:.3f} s".format(time.time() - ats))

Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.

The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!