Robust and fast checksum algorithm?

前端 未结 10 1564
失恋的感觉
失恋的感觉 2020-12-23 20:30

Which checksum algorithm can you recommend in the following use case?

I want to generate checksums of small JPEG files (~8 kB each) to check if the content changed.

相关标签:
10条回答
  • 2020-12-23 20:58

    adler32, available in the zlib headers, is advertised as being significantly faster than crc32, while being only slightly less accurate.

    0 讨论(0)
  • 2020-12-23 20:58

    Your most important requirement is "to check if the content changed".

    If it most important that ANY change in the file be detected, MD-5, SHA-1 or even SHA-256 should be your choice.

    Given that you indicated that the checksum NOT be cryptographically good, I would recommend CRC-32 for three reasons. CRC-32 gives good hamming distances over an 8K file. CRC-32 will be at least an order of magnitude faster than MD-5 to calculate (your second requirement). Sometimes as important, CRC-32 only requires 32 bits to store the value to be compared. MD-5 requires 4 times the storage and SHA-1 requires 5 times the storage.

    BTW, any technique will be strengthened by prepending the length of the file when calculating the hash.

    0 讨论(0)
  • 2020-12-23 21:02

    There are lots of fast CRC algorithms that should do the trick: http://www.google.com/search?hl=en&q=fast+crc&aq=f&oq=

    Edit: Why the hate? CRC is totally appropriate, as evidenced by the other answers. A Google search was also appropriate, since no language was specified. This is an old, old problem which has been solved so many times that there isn't likely to be a definitive answer.

    0 讨论(0)
  • 2020-12-23 21:03

    CRC32 is probably good enough, although there's a small chance you might get a collision, such that a file that has been modified might look like it hasn't been because the two versions generate the same checksum. To avoid this possibility I'd therefore suggest using MD5, which will easily be fast enough, and the chances of a collision occurring is reduced to the point where it's almost infinitessimal.

    As others have said, with lots of small files your real performance bottleneck is going to be I/O so the issue is dealing with that. If you post up a few more details somebody will probably suggest a way of sorting that out as well.

    0 讨论(0)
提交回复
热议问题