Why does Java's hashCode() in String use 31 as a multiplier?

前端 未结 13 2247
星月不相逢
星月不相逢 2020-11-22 01:34

Per the Java documentation, the hash code for a String object is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
<         


        
13条回答
  •  -上瘾入骨i
    2020-11-22 02:32

    From JDK-4045622, where Joshua Bloch describes the reasons why that particular (new) String.hashCode() implementation was chosen

    The table below summarizes the performance of the various hash functions described above, for three data sets:

    1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).

    2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/ and /usr/openwin/bin/* (66,304 strings, avg length 21 characters).

    3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters).

    The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element).

                              Webster's   Code Strings    URLs
                              ---------   ------------    ----
    Current Java Fn.          1.2509      1.2738          13.2560
    P(37)    [Java]           1.2508      1.2481          1.2454
    P(65599) [Aho et al]      1.2490      1.2510          1.2450
    P(31)    [K+R]            1.2500      1.2488          1.2425
    P(33)    [Torek]          1.2500      1.2500          1.2453
    Vo's Fn                   1.2487      1.2471          1.2462
    WAIS Fn                   1.2497      1.2519          1.2452
    Weinberger's Fn(MatPak)   6.5169      7.2142          30.6864
    Weinberger's Fn(24)       1.3222      1.2791          1.9732
    Weinberger's Fn(28)       1.2530      1.2506          1.2439
    

    Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function.

    I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.

    Josh

提交回复
热议问题