Probability of 64bit Hash Code Collisions

后端 未结 4 964
天涯浪人
天涯浪人 2021-02-04 07:30

The book Numerical Recipes offers a method to calculate 64bit hash codes in order to reduce the number of collisions.

The algorithm is shown at http://www.javamex.com/tu

4条回答
  •  有刺的猬
    2021-02-04 07:57

    1) Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

    The probability of a single collision occurring depends on the key set generated as the hash function is uniform we can do following to calculate the probability that collision doesnt occurs at generation of k keys as follows :-

    x = hash size
    p(k=2) = (x-1)/x
    p(k=3) = p(k=2)*(x-2)/x
    ..
    p(k=n) = (x-1)*(x-2)..(x-n+1)/x^n
    
    p(k=n) ~ e^-(n*n)/2x
    
    p(collision|k=n) = 1-p(k=n) = 1 - e^(-n^2)/2x
    p(collision) > 0.5 if n ~ sqrt(x)
    

    Hence if sqrt(2^64) keys that is 2^32 key are generated there is higher chance that there is a single collision.

    2) Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?

    x = 2^64 
    Use the formula pc(k=n) = 1 - e^-(n^2)/2x
    

    3) Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking?

    This is a very interesting question because it depends on the size of key space. Suppose your keys are generated at random from space of size = s and hash space is x=2^64 as you mentioned. Probability of collision is Pc(k=n|x) = 1-e^(-n^2)/2x. If Probability of choosing same key in key space is P(k=n|s) = 1-e^(-n^2)/2s . For it to be sure that if hash is same then keys are same:-

    P(k=n|s) > Pc(k=n|x)
    1-e^-(n^2/2s) > 1-e^-(n^2/2x) 
    n^2/2s > n^2/2x 
    s < x
    s < 2^64
    

    Hence it shows that for keys to be same if hash is same that key set size must be small than 2^64 approx otherwise there is a chance of collision in hash more than in key set. The result is independent of number of keys generated.

提交回复
热议问题