Fastest hash for non-cryptographic uses?

后端 未结 13 1016
挽巷
挽巷 2020-12-04 08:23

I\'m essentially preparing phrases to be put into the database, they may be malformed so I want to store a short hash of them instead (I will be simply comparing if they exi

相关标签:
13条回答
  • 2020-12-04 08:40

    2019 update: This answer is the most up to date. Libraries to support murmur are largely available for all languages.

    The current recommendation is to use the Murmur Hash Family (see specifically the murmur2 or murmur3 variants).

    Murmur hashes were designed for fast hashing with minimal collisions (much faster than CRC, MDx and SHAx). It's perfect to look for duplicates and very appropriate for HashTable indexes.

    In fact it's used by many of the modern databases (Redis, ElastisSearch, Cassandra) to compute all sort of hashes for various purposes. This specific algorithm was the root source of many performance improvements in the current decade.

    It's also used in implementations of Bloom Filters. You should be aware that if you're searching for "fast hashes", you're probably facing a typical problem that is solved by Bloom filters. ;-)

    Note: murmur is a general purpose hash, meaning NON cryptographic. It doesn't prevent to find the source "text" that generated a hash. It's NOT appropriate to hash passwords.

    Some more details: MurmurHash - what is it?

    0 讨论(0)
  • 2020-12-04 08:41

    Instead of assuming that MD5 is "fairly slow", try it. A simple C-based implementation of MD5 on a simple PC (mine, a 2.4 GHz Core2, using a single core) can hash 6 millions of small messages per second. A small message is here anything up to 55 bytes. For longer messages, MD5 hashing speed is linear with the message size, i.e. it crunches data at about 400 megabytes per second. You may note that this is four times the maximum speed of a good harddisk or a gigabit ethernet network card.

    Since my PC has four cores, this means that hashing data as fast as my harddisk can provide or receive uses at most 6% of the available computing power. It takes a very special situation for hashing speed to become a bottleneck or even to induce a noticeable cost on a PC.

    On much smaller architectures where hashing speed may become somewhat relevant, you may want to use MD4. MD4 is fine for non-cryptographic purposes (and for cryptographic purposes, you should not be using MD5 anyway). It has been reported that MD4 is even faster than CRC32 on ARM-based platforms.

    0 讨论(0)
  • 2020-12-04 08:42

    Step One: Install libsodium (or make sure you're using PHP 7.2+)

    Step Two: Use one of the following:

    1. sodium_crypto_generichash(), which is BLAKE2b, a hash function more secure than MD5 but faster than SHA256. (Link has benchmarks, etc.)
    2. sodium_crypto_shorthash(), which is SipHash-2-4, which is appropriate for hash tables but should not be relied on for collision resistance.

    _shorthash is about 3x as fast as _generichash, but you need a key and you have a small-but-realistic risk of collisions. With _generichash, you probably don't need to worry about collisions, and don't need to use a key (but may want to anyway).

    0 讨论(0)
  • 2020-12-04 08:42

    Adler32 performs best on my machine. And md5() turned out faster than crc32().

    0 讨论(0)
  • 2020-12-04 08:45
    fcn     time  generated hash
    crc32:  0.03163  798740135
    md5:    0.0731   0dbab6d0c841278d33be207f14eeab8b
    sha1:   0.07331  417a9e5c9ac7c52e32727cfd25da99eca9339a80
    xor:    0.65218  119
    xor2:   0.29301  134217728
    add:    0.57841  1105
    

    And the code used to generate this is:

     $loops = 100000;
     $str = "ana are mere";
    
     echo "<pre>";
    
     $tss = microtime(true);
     for($i=0; $i<$loops; $i++){
      $x = crc32($str);
     }
     $tse = microtime(true);
     echo "\ncrc32: \t" . round($tse-$tss, 5) . " \t" . $x;
    
     $tss = microtime(true);
     for($i=0; $i<$loops; $i++){
      $x = md5($str);
     }
     $tse = microtime(true);
     echo "\nmd5: \t".round($tse-$tss, 5) . " \t" . $x;
    
     $tss = microtime(true);
     for($i=0; $i<$loops; $i++){
      $x = sha1($str);
     }
     $tse = microtime(true);
     echo "\nsha1: \t".round($tse-$tss, 5) . " \t" . $x;
    
     $tss = microtime(true);
     for($i=0; $i<$loops; $i++){
      $l = strlen($str);
      $x = 0x77;
      for($j=0;$j<$l;$j++){
       $x = $x xor ord($str[$j]);
      }
     }
     $tse = microtime(true);
     echo "\nxor: \t".round($tse-$tss, 5) . " \t" . $x;
    
     $tss = microtime(true);
     for($i=0; $i<$loops; $i++){
      $l = strlen($str);
      $x = 0x08;
      for($j=0;$j<$l;$j++){
       $x = ($x<<2) xor $str[$j];
      }
     }
     $tse = microtime(true);
     echo "\nxor2: \t".round($tse-$tss, 5) . " \t" . $x;
    
     $tss = microtime(true);
     for($i=0; $i<$loops; $i++){
      $l = strlen($str);
      $x = 0;
      for($j=0;$j<$l;$j++){
       $x = $x + ord($str[$j]);
      }
     }
     $tse = microtime(true);
     echo "\nadd: \t".round($tse-$tss, 5) . " \t" . $x;
    
    0 讨论(0)
  • 2020-12-04 08:50

    CRC32 is pretty fast and there's a function for it: http://www.php.net/manual/en/function.crc32.php

    But you should be aware that CRC32 will have more collisions than MD5 or even SHA-1 hashes, simply because of the reduced length (32 bits compared to 128 bits respectively 160 bits). But if you just want to check whether a stored string is corrupted, you'll be fine with CRC32.

    0 讨论(0)
提交回复
热议问题