What integer hash function are good that accepts an integer hash key?
For random hash values, some engineers said golden ratio prime number(2654435761) is a bad choice, with my testing results, I found that it's not true; instead, 2654435761 distributes the hash values pretty good.
#define MCR_HashTableSize 2^10
unsigned int
Hash_UInt_GRPrimeNumber(unsigned int key)
{
key = key*2654435761 & (MCR_HashTableSize - 1)
return key;
}
The hash table size must be a power of two.
I have written a test program to evaluate many hash functions for integers, the results show that GRPrimeNumber is a pretty good choice.
I have tried:
With my testing results, I found that Golden Ratio Prime Number always has the fewer empty buckets or zero empty bucket and the shortest collision chain length.
Some hash functions for integers are claimed to be good, but the testing results show that when the total_data_entry / total_bucket_number = 3, the longest chain length is bigger than 10(max collision number > 10), and many buckets are not mapped(empty buckets), which is very bad, compared with the result of zero empty bucket and longest chain length 3 by Golden Ratio Prime Number Hashing.
BTW, with my testing results, I found one version of shifting-xor hash functions is pretty good(It's shared by mikera).
unsigned int Hash_UInt_M3(unsigned int key)
{
key ^= (key << 13);
key ^= (key >> 17);
key ^= (key << 5);
return key;
}
The answer depends on a lot of things like:
I suggest that you take a look at the Merkle-Damgard family of hash functions like SHA-1 etc
I don't think we can say that a hash function is "good" without knowing your data in advance ! and without knowing what you're going to do with it.
There are better data structures than hash tables for unknown data sizes (I'm assuming you're doing the hashing for a hash table here ). I would personally use a hash table when I Know I have a "finite" number of elements that are needing stored in a limited amount of memory. I would try and do a quick statistical analysis on my data, see how it is distributed etc before I start thinking about my hash function.
Knuth's multiplicative method:
hash(i)=i*2654435761 mod 2^32
In general, you should pick a multiplier that is in the order of your hash size (2^32
in the example) and has no common factors with it. This way the hash function covers all your hash space uniformly.
Edit: The biggest disadvantage of this hash function is that it preserves divisibility, so if your integers are all divisible by 2 or by 4 (which is not uncommon), their hashes will be too. This is a problem in hash tables - you can end up with only 1/2 or 1/4 of the buckets being used.
Fast and good hash functions can be composed from fast permutations with lesser qualities, like
To yield a hashing function with superior qualities, like demonstrated with PCG for random number generation.
This is in fact also the recipe rrxmrrxmsx_0 and murmur hash are using, knowingly or unknowingly.
I personally found
uint64_t xorshift(const uint64_t& n,int i){
return n^(n>>i);
}
uint64_t hash(const uint64_t& n){
uint64_t p = 0x5555555555555555ull; // pattern of alternating 0 and 1
uint64_t c = 17316035218449499591ull;// random uneven integer constant;
return c*xorshift(p*xorshift(n,32),32);
}
to be good enough.
A good hash function should
Let's first look at the identity function. It satisfies 1. but not 2. :
Input bit n determines output bit n with a correlation of 100% (red) and no others, they are therefore blue, giving a perfect red line across.
A xorshift(n,32) is not much better, yielding one and half a line. Still satisfying 1., because it is invertible with a second application.
A multiplication with an unsigned integer is much better, cascading more strongly and flipping more output bits with a probability of 0.5, which is what you want, in green. It satisfies 1. as for each uneven integer there is a multiplicative inverse.
Combining the two gives the following output, still satisfying 1. as the composition of two bijective functions yields another bijective function.
A second application of multiplication and xorshift will yield the following:
Or you can use Galois field multiplications like GHash, they have become reasonably fast on modern CPUs and have superior qualities in one step.
uint64_t const inline gfmul(const uint64_t& i,const uint64_t& j){
__m128i I{};I[0]^=i;
__m128i J{};J[0]^=j;
__m128i M{};M[0]^=0xb000000000000000ull;
__m128i X = _mm_clmulepi64_si128(I,J,0);
__m128i A = _mm_clmulepi64_si128(X,M,0);
__m128i B = _mm_clmulepi64_si128(A,M,0);
return A[0]^A[1]^B[1]^X[0]^X[1];
}
Depends on how your data is distributed. For a simple counter, the simplest function
f(i) = i
will be good (I suspect optimal, but I can't prove it).