I\'m essentially preparing phrases to be put into the database, they may be malformed so I want to store a short hash of them instead (I will be simply comparing if they exi
2019 update: This answer is the most up to date. Libraries to support murmur are largely available for all languages.
The current recommendation is to use the Murmur Hash Family (see specifically the murmur2 or murmur3 variants).
Murmur hashes were designed for fast hashing with minimal collisions (much faster than CRC, MDx and SHAx). It's perfect to look for duplicates and very appropriate for HashTable indexes.
In fact it's used by many of the modern databases (Redis, ElastisSearch, Cassandra) to compute all sort of hashes for various purposes. This specific algorithm was the root source of many performance improvements in the current decade.
It's also used in implementations of Bloom Filters. You should be aware that if you're searching for "fast hashes", you're probably facing a typical problem that is solved by Bloom filters. ;-)
Note: murmur is a general purpose hash, meaning NON cryptographic. It doesn't prevent to find the source "text" that generated a hash. It's NOT appropriate to hash passwords.
Some more details: MurmurHash - what is it?
Instead of assuming that MD5 is "fairly slow", try it. A simple C-based implementation of MD5 on a simple PC (mine, a 2.4 GHz Core2, using a single core) can hash 6 millions of small messages per second. A small message is here anything up to 55 bytes. For longer messages, MD5 hashing speed is linear with the message size, i.e. it crunches data at about 400 megabytes per second. You may note that this is four times the maximum speed of a good harddisk or a gigabit ethernet network card.
Since my PC has four cores, this means that hashing data as fast as my harddisk can provide or receive uses at most 6% of the available computing power. It takes a very special situation for hashing speed to become a bottleneck or even to induce a noticeable cost on a PC.
On much smaller architectures where hashing speed may become somewhat relevant, you may want to use MD4. MD4 is fine for non-cryptographic purposes (and for cryptographic purposes, you should not be using MD5 anyway). It has been reported that MD4 is even faster than CRC32 on ARM-based platforms.
Step One: Install libsodium (or make sure you're using PHP 7.2+)
Step Two: Use one of the following:
sodium_crypto_generichash()
, which is BLAKE2b, a hash function more secure than MD5 but faster than SHA256. (Link has benchmarks, etc.)sodium_crypto_shorthash()
, which is SipHash-2-4, which is appropriate for hash tables but should not be relied on for collision resistance._shorthash
is about 3x as fast as _generichash
, but you need a key and you have a small-but-realistic risk of collisions. With _generichash
, you probably don't need to worry about collisions, and don't need to use a key (but may want to anyway).
Adler32 performs best on my machine.
And md5()
turned out faster than crc32()
.
fcn time generated hash
crc32: 0.03163 798740135
md5: 0.0731 0dbab6d0c841278d33be207f14eeab8b
sha1: 0.07331 417a9e5c9ac7c52e32727cfd25da99eca9339a80
xor: 0.65218 119
xor2: 0.29301 134217728
add: 0.57841 1105
And the code used to generate this is:
$loops = 100000;
$str = "ana are mere";
echo "<pre>";
$tss = microtime(true);
for($i=0; $i<$loops; $i++){
$x = crc32($str);
}
$tse = microtime(true);
echo "\ncrc32: \t" . round($tse-$tss, 5) . " \t" . $x;
$tss = microtime(true);
for($i=0; $i<$loops; $i++){
$x = md5($str);
}
$tse = microtime(true);
echo "\nmd5: \t".round($tse-$tss, 5) . " \t" . $x;
$tss = microtime(true);
for($i=0; $i<$loops; $i++){
$x = sha1($str);
}
$tse = microtime(true);
echo "\nsha1: \t".round($tse-$tss, 5) . " \t" . $x;
$tss = microtime(true);
for($i=0; $i<$loops; $i++){
$l = strlen($str);
$x = 0x77;
for($j=0;$j<$l;$j++){
$x = $x xor ord($str[$j]);
}
}
$tse = microtime(true);
echo "\nxor: \t".round($tse-$tss, 5) . " \t" . $x;
$tss = microtime(true);
for($i=0; $i<$loops; $i++){
$l = strlen($str);
$x = 0x08;
for($j=0;$j<$l;$j++){
$x = ($x<<2) xor $str[$j];
}
}
$tse = microtime(true);
echo "\nxor2: \t".round($tse-$tss, 5) . " \t" . $x;
$tss = microtime(true);
for($i=0; $i<$loops; $i++){
$l = strlen($str);
$x = 0;
for($j=0;$j<$l;$j++){
$x = $x + ord($str[$j]);
}
}
$tse = microtime(true);
echo "\nadd: \t".round($tse-$tss, 5) . " \t" . $x;
CRC32 is pretty fast and there's a function for it: http://www.php.net/manual/en/function.crc32.php
But you should be aware that CRC32 will have more collisions than MD5 or even SHA-1 hashes, simply because of the reduced length (32 bits compared to 128 bits respectively 160 bits). But if you just want to check whether a stored string is corrupted, you'll be fine with CRC32.