Why does Java's hashCode() in String use 31 as a multiplier?

前端 未结 13 2280
星月不相逢
星月不相逢 2020-11-22 01:34

Per the Java documentation, the hash code for a String object is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
<         


        
13条回答
  •  梦谈多话
    2020-11-22 02:05

    A big expectation from hash functions is that their result's uniform randomness survives an operation such as hash(x) % N where N is an arbitrary number (and in many cases, a power of two), one reason being that such operations are used commonly in hash tables for determining slots. Using prime number multipliers when computing the hash decreases the probability that your multiplier and the N share divisors, which would make the result of the operation less uniformly random.

    Others have pointed out the nice property that multiplication by 31 can be done by a multiplication and a subtraction. I just want to point out that there is a mathematical term for such primes: Mersenne Prime

    All mersenne primes are one less than a power of two so we can write them as:

    p = 2^n - 1
    

    Multiplying x by p:

    x * p = x * (2^n - 1) = x * 2^n - x = (x << n) - x
    

    Shifts (SAL/SHL) and subtractions (SUB) are generally faster than multiplications (MUL) on many machines. See instruction tables from Agner Fog

    That's why GCC seems to optimize multiplications by mersenne primes by replacing them with shifts and subs, see here.

    However, in my opinion, such a small prime is a bad choice for a hash function. With a relatively good hash function, you would expect to have randomness at the higher bits of the hash. However, with the Java hash function, there is almost no randomness at the higher bits with shorter strings (and still highly questionable randomness at the lower bits). This makes it more difficult to build efficient hash tables. See this nice trick you couldn't do with the Java hash function.

    Some answers mention that they believe it is good that 31 fits into a byte. This is actually useless since:

    (1) We execute shifts instead of multiplications, so the size of the multiplier does not matter.

    (2) As far as I know, there is no specific x86 instruction to multiply an 8 byte value with a 1 byte value so you would have needed to convert "31" to a 8 byte value anyway even if you were multiplying. See here, you multiply entire 64bit registers.

    (And 127 is actually the largest mersenne prime that could fit in a byte.)

    Does a smaller value increase randomness in the middle-lower bits? Maybe, but it also seems to greatly increase the possible collisions :).

    One could list many different issues but they generally boil down to two core principles not being fulfilled well: Confusion and Diffusion

    But is it fast? Probably, since it doesn't do much. However, if performance is really the focus here, one character per loop is quite inefficient. Why not do 4 characters at a time (8 bytes) per loop iteration for longer strings, like this? Well, that would be difficult to do with the current definition of hash where you need to multiply every character individually (please tell me if there is a bit hack to solve this :D).

提交回复
热议问题