Why does Java's hashCode() in String use 31 as a multiplier?

前端 未结 13 2269
星月不相逢
星月不相逢 2020-11-22 01:34

Per the Java documentation, the hash code for a String object is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
<         


        
相关标签:
13条回答
  • 2020-11-22 02:05

    A big expectation from hash functions is that their result's uniform randomness survives an operation such as hash(x) % N where N is an arbitrary number (and in many cases, a power of two), one reason being that such operations are used commonly in hash tables for determining slots. Using prime number multipliers when computing the hash decreases the probability that your multiplier and the N share divisors, which would make the result of the operation less uniformly random.

    Others have pointed out the nice property that multiplication by 31 can be done by a multiplication and a subtraction. I just want to point out that there is a mathematical term for such primes: Mersenne Prime

    All mersenne primes are one less than a power of two so we can write them as:

    p = 2^n - 1
    

    Multiplying x by p:

    x * p = x * (2^n - 1) = x * 2^n - x = (x << n) - x
    

    Shifts (SAL/SHL) and subtractions (SUB) are generally faster than multiplications (MUL) on many machines. See instruction tables from Agner Fog

    That's why GCC seems to optimize multiplications by mersenne primes by replacing them with shifts and subs, see here.

    However, in my opinion, such a small prime is a bad choice for a hash function. With a relatively good hash function, you would expect to have randomness at the higher bits of the hash. However, with the Java hash function, there is almost no randomness at the higher bits with shorter strings (and still highly questionable randomness at the lower bits). This makes it more difficult to build efficient hash tables. See this nice trick you couldn't do with the Java hash function.

    Some answers mention that they believe it is good that 31 fits into a byte. This is actually useless since:

    (1) We execute shifts instead of multiplications, so the size of the multiplier does not matter.

    (2) As far as I know, there is no specific x86 instruction to multiply an 8 byte value with a 1 byte value so you would have needed to convert "31" to a 8 byte value anyway even if you were multiplying. See here, you multiply entire 64bit registers.

    (And 127 is actually the largest mersenne prime that could fit in a byte.)

    Does a smaller value increase randomness in the middle-lower bits? Maybe, but it also seems to greatly increase the possible collisions :).

    One could list many different issues but they generally boil down to two core principles not being fulfilled well: Confusion and Diffusion

    But is it fast? Probably, since it doesn't do much. However, if performance is really the focus here, one character per loop is quite inefficient. Why not do 4 characters at a time (8 bytes) per loop iteration for longer strings, like this? Well, that would be difficult to do with the current definition of hash where you need to multiply every character individually (please tell me if there is a bit hack to solve this :D).

    0 讨论(0)
  • 2020-11-22 02:06

    Actually, 37 would work pretty well! z := 37 * x can be computed as y := x + 8 * x; z := x + 4 * y. Both steps correspond to one LEA x86 instructions, so this is extremely fast.

    In fact, multiplication with the even-larger prime 73 could be done at the same speed by setting y := x + 8 * x; z := x + 8 * y.

    Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.

    Moreover, 73 is Sheldon Cooper's favorite number.

    0 讨论(0)
  • 2020-11-22 02:08

    Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.

    The numbers that make up using a hash are typically:

    • modulus of the data type you put it into (2^32 or 2^64)
    • modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
    • multiply or shift by a magic number in your mixing function
    • The input value

    You really only get to control a couple of these values, so a little extra care is due.

    0 讨论(0)
  • 2020-11-22 02:11

    According to Joshua Bloch's Effective Java (a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):

    The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

    (from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)

    0 讨论(0)
  • 2020-11-22 02:16

    On (mostly) old processors, multiplying by 31 can be relatively cheap. On an ARM, for instance, it is only one instruction:

    RSB       r1, r0, r0, ASL #5    ; r1 := - r0 + (r0<<5)
    

    Most other processors would require a separate shift and subtract instruction. However, if your multiplier is slow this is still a win. Modern processors tend to have fast multipliers so it doesn't make much difference, so long as 32 goes on the correct side.

    It's not a great hash algorithm, but it's good enough and better than the 1.0 code (and very much better than the 1.0 spec!).

    0 讨论(0)
  • 2020-11-22 02:25

    You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:

    Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.

    So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).

    0 讨论(0)
提交回复
热议问题