What integer hash function are good that accepts an integer hash key?

前端 未结 11 1609
孤街浪徒
孤街浪徒 2020-11-22 17:23

What integer hash function are good that accepts an integer hash key?

相关标签:
11条回答
  • 2020-11-22 17:39

    For random hash values, some engineers said golden ratio prime number(2654435761) is a bad choice, with my testing results, I found that it's not true; instead, 2654435761 distributes the hash values pretty good.

    #define MCR_HashTableSize 2^10
    
    unsigned int
    Hash_UInt_GRPrimeNumber(unsigned int key)
    {
      key = key*2654435761 & (MCR_HashTableSize - 1)
      return key;
    }
    

    The hash table size must be a power of two.

    I have written a test program to evaluate many hash functions for integers, the results show that GRPrimeNumber is a pretty good choice.

    I have tried:

    1. total_data_entry_number / total_bucket_number = 2, 3, 4; where total_bucket_number = hash table size;
    2. map hash value domain into bucket index domain; that is, convert hash value into bucket index by Logical And Operation with (hash_table_size - 1), as shown in Hash_UInt_GRPrimeNumber();
    3. calculate the collision number of each bucket;
    4. record the bucket that has not been mapped, that is, an empty bucket;
    5. find out the max collision number of all buckets; that is, the longest chain length;

    With my testing results, I found that Golden Ratio Prime Number always has the fewer empty buckets or zero empty bucket and the shortest collision chain length.

    Some hash functions for integers are claimed to be good, but the testing results show that when the total_data_entry / total_bucket_number = 3, the longest chain length is bigger than 10(max collision number > 10), and many buckets are not mapped(empty buckets), which is very bad, compared with the result of zero empty bucket and longest chain length 3 by Golden Ratio Prime Number Hashing.

    BTW, with my testing results, I found one version of shifting-xor hash functions is pretty good(It's shared by mikera).

    unsigned int Hash_UInt_M3(unsigned int key)
    {
      key ^= (key << 13);
      key ^= (key >> 17);    
      key ^= (key << 5); 
      return key;
    }
    
    0 讨论(0)
  • 2020-11-22 17:45

    The answer depends on a lot of things like:

    • Where do you intend to employ it?
    • What are you trying to do with the hash?
    • Do you need a crytographically secure hash function?

    I suggest that you take a look at the Merkle-Damgard family of hash functions like SHA-1 etc

    0 讨论(0)
  • 2020-11-22 17:46

    I don't think we can say that a hash function is "good" without knowing your data in advance ! and without knowing what you're going to do with it.

    There are better data structures than hash tables for unknown data sizes (I'm assuming you're doing the hashing for a hash table here ). I would personally use a hash table when I Know I have a "finite" number of elements that are needing stored in a limited amount of memory. I would try and do a quick statistical analysis on my data, see how it is distributed etc before I start thinking about my hash function.

    0 讨论(0)
  • 2020-11-22 17:50

    Knuth's multiplicative method:

    hash(i)=i*2654435761 mod 2^32
    

    In general, you should pick a multiplier that is in the order of your hash size (2^32 in the example) and has no common factors with it. This way the hash function covers all your hash space uniformly.

    Edit: The biggest disadvantage of this hash function is that it preserves divisibility, so if your integers are all divisible by 2 or by 4 (which is not uncommon), their hashes will be too. This is a problem in hash tables - you can end up with only 1/2 or 1/4 of the buckets being used.

    0 讨论(0)
  • 2020-11-22 17:51

    Fast and good hash functions can be composed from fast permutations with lesser qualities, like

    • multiplication with an uneven integer
    • binary rotations
    • xorshift

    To yield a hashing function with superior qualities, like demonstrated with PCG for random number generation.

    This is in fact also the recipe rrxmrrxmsx_0 and murmur hash are using, knowingly or unknowingly.

    I personally found

    uint64_t xorshift(const uint64_t& n,int i){
      return n^(n>>i);
    }
    uint64_t hash(const uint64_t& n){
      uint64_t p = 0x5555555555555555ull; // pattern of alternating 0 and 1
      uint64_t c = 17316035218449499591ull;// random uneven integer constant; 
      return c*xorshift(p*xorshift(n,32),32);
    }
    

    to be good enough.

    A good hash function should

    1. be bijective to not loose information, if possible and have the least collisions
    2. cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5.

    Let's first look at the identity function. It satisfies 1. but not 2. :

    Input bit n determines output bit n with a correlation of 100% (red) and no others, they are therefore blue, giving a perfect red line across.

    A xorshift(n,32) is not much better, yielding one and half a line. Still satisfying 1., because it is invertible with a second application.

    A multiplication with an unsigned integer is much better, cascading more strongly and flipping more output bits with a probability of 0.5, which is what you want, in green. It satisfies 1. as for each uneven integer there is a multiplicative inverse.

    Combining the two gives the following output, still satisfying 1. as the composition of two bijective functions yields another bijective function.

    A second application of multiplication and xorshift will yield the following:

    Or you can use Galois field multiplications like GHash, they have become reasonably fast on modern CPUs and have superior qualities in one step.

       uint64_t const inline gfmul(const uint64_t& i,const uint64_t& j){           
         __m128i I{};I[0]^=i;                                                          
         __m128i J{};J[0]^=j;                                                          
         __m128i M{};M[0]^=0xb000000000000000ull;                                      
         __m128i X = _mm_clmulepi64_si128(I,J,0);                                      
         __m128i A = _mm_clmulepi64_si128(X,M,0);                                      
         __m128i B = _mm_clmulepi64_si128(A,M,0);                                      
         return A[0]^A[1]^B[1]^X[0]^X[1];                                              
       }
    
    0 讨论(0)
  • 2020-11-22 17:52

    Depends on how your data is distributed. For a simple counter, the simplest function

    f(i) = i
    

    will be good (I suspect optimal, but I can't prove it).

    0 讨论(0)
提交回复
热议问题