Why only use primes for Hash function division method

问题

Hashing using division method means h(k) = k mod m . I read that

m should not be power of 2. This is because if m = 2^p, h becomes just the p lowest-order bits of k. Usually we choose m to be a prime number not too close to a power of 2.

Could someone explain with a small example the lowest order bits part? I thought all (mod m) does is that it wraps the result around a range m. Somehow cant see the issue if m was power of 2.

回答1:

All data in the computer is stored as binary data. A binary number is written in base-2.

If you hash data, you want to create a fingerprint that is easy comparable. If we have similar data that is not exactly the same as the original data, it shouldn't create the same fingerprint (hash).

Guess what happens if you use an m where m = 2^p (p is int >= 0). Because 2^7 is a multiple of 2^4 for example, all bits left from 2^4 will be reduced to 0. You cut off part of the data. This means that if the data is different in the left-most bits of the binary number, they will create the same hash.

Example:

k:    1111111111010101
m:    0000000001000000 (2^6)
k(m): 0000000000010101

Now do the same for this:

k:    0000000000010101
m:    0000000001000000 (2^6)
k(m): 0000000000010101

Hey, that is the same hash! This is exactly the reason why a number far from 2^p is chosen. This way the left-most bits do matter in calculating the hash, and it is far less likely that two similar pieces of data create identical hashes.

回答2:

The remainder of division can* be computed by repeatedly removing the divisor until the number to be divided is less than the divisor. For a binary number and a power of two divisor this subtraction only affects the bits on the left, making them 0, but keeps the bits to the right unchanged.

  1110001111100001₂   58337
- 0000000100000000₂      2⁸
= 1110001011100001₂   58081
___________________________

  1110001011100001₂   58081
- 0000000100000000₂      2⁸
- 0000000100000000₂      2⁸
  ...
= 0000000011100001₂   57569

When the divisor is uneven all the lower bits can be affected when repeatedly removing it:

  1110001111100001₂   58337
- 0000000011000011₂     195
= 1110001100011110₂   58142
___________________________

  1110001111100001₂   58337
- 0000000011000011₂     195
- 0000000011000011₂     195
  ...
= 0000000000100000₂      32

Note that it is sufficient for the divisor to not be divisible by two, because each factor of two shifts a number one digit to the left in binary and a subtraction can on only change digits to the left, never to the right.

The further away the divisor is from a power of two, that is the more the digits are equally 0 and 1 the more digits of the remainder are going to be affected with each subtraction. This means for example that the modulus 78985 (10011010010001001₂) is much better than 65537 (10000000000000001₂) even though it is not prime, whereas 65537 is.

This all applies where the hash is "poor", that is not equally distributed in all output bits. If we had a good hash, we can use all hash table sizes and therefore divisors we want to and any methods of range reduction like fastrange.

*It is typically not actually computed like this, but the result is equivalent

来源：https://stackoverflow.com/questions/12102625/why-only-use-primes-for-hash-function-division-method

标签

hash

modulo