What is the fastest hash algorithm to check if two files are equal?

后端 未结 12 1519
野性不改
野性不改 2020-12-07 10:15

What is the fastest way to create a hash function which will be used to check if two files are equal?

Security is not very important.

Edit: I am sending a fi

相关标签:
12条回答
  • 2020-12-07 10:53

    One approach might be to use a simple CRC-32 algorithm, and only if the CRC values compare equal, rerun the hash with a SHA1 or something more robust. A fast CRC-32 will outperform a cryptographically secure hash any day.

    0 讨论(0)
  • 2020-12-07 10:54

    If it's only a one off then given that you'll have to read both files to generate a hash of both of them, why not just read through a small amount of each at a time and compare?

    Failing that CRC is a very simple algorithm.

    0 讨论(0)
  • 2020-12-07 10:57

    What we are optimizing here is time spent on a task. Unfortunately we do not know enough about the task at hand to know what the optimal solution should be.

    Is it for one-time comparison of 2 arbitrary files? Then compare size, and after that simply compare the files, byte by byte (or mb by mb) if that's better for your IO.

    If it is for 2 large sets of files, or many sets of files, and it is not a one-time exercise. but something that will happen frequently, then one should store hashes for each file. A hash is never unique, but a hash with a number of say 9 digits (32 bits) would be good for about 4 billion combination, and a 64 bit number would be good enough to distinguish between some 16 * 10^18 Quintillion different files.

    A decent compromise would be to generate 2 32-bit hashes for each file, one for first 8k, another for 1MB+8k, slap them together as a single 64 bit number. Cataloging all existing files into a DB should be fairly quick, and looking up a candidate file against this DB should also be very quick. Once there is a match, the only way to determine if they are the same is to compare the whole files.

    I am a believer in giving people what they need, which is not always never what they think they need, or what the want.

    0 讨论(0)
  • 2020-12-07 10:57

    you might check out the algorithm that the samba/rsync developers use. I haven't looked at it in depth, but i see it mentioned all the time. apparently its quite good.

    0 讨论(0)
  • 2020-12-07 10:57

    I remember the old modem transfer protocols, like Zmodem, would do some sort of CRC compare for each block as it was sent. CRC32, if I remember ancient history well enough. I'm not suggesting you make your own transfer protocol, unless that's exactly what you're doing, but you could maybe have it spot check a block of the file periodically, or maybe doing hashes of each 8k block would be simple enough for the processors to handle. Haven't tried it, myself.

    0 讨论(0)
  • 2020-12-07 10:59

    xxhash purports itself as quite fast and strong, collision-wise:

    http://cyan4973.github.io/xxHash/

    There is a 64 bit variant that runs "even faster" on 64 bit processors than the 32, overall, though slower on 32-bit processors (go figure).

    http://code.google.com/p/crcutil is also said to be quite fast (and leverages hardware CRC instructions where present, which are probably very fast, but if you don't have hardware that supports them, aren't as fast). Don't know if CRC32c is as good of a hash (in terms of collisions) as xxHash or not...

    https://code.google.com/p/cityhash/ seems similar and related to crcutil [in that it can compile down to use hardware CRC32c instructions if instructed].

    If you "just want the fastest raw speed" and don't care as much about quality of random distribution of the hash output (for instance, with small sets, or where speed is paramount), there are some fast algorithms mentioned here: http://www.sanmayce.com/Fastest_Hash/ (these "not quite random" distribution type algorithms are, in some cases, "good enough" and very fast). Apparently FNV1A_Jesteress is the fastest for "long" strings, some others possibly for small strings. http://locklessinc.com/articles/fast_hash/ also seems related. I did not research to see what the collision properties of these are.

    0 讨论(0)
提交回复
热议问题