Probability of 64bit Hash Code Collisions

后端未结

关注

 4  987

天涯浪人 2021-02-04 07:30

The book Numerical Recipes offers a method to calculate 64bit hash codes in order to reduce the number of collisions.

The algorithm is shown at http://www.javamex.com/tu

4条回答

有刺的猬 (楼主)

2021-02-04 07:57
1) Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

The probability of a single collision occurring depends on the key set generated as the hash function is uniform we can do following to calculate the probability that collision doesnt occurs at generation of k keys as follows :-
```
x = hash size
p(k=2) = (x-1)/x
p(k=3) = p(k=2)*(x-2)/x
..
p(k=n) = (x-1)*(x-2)..(x-n+1)/x^n

p(k=n) ~ e^-(n*n)/2x

p(collision|k=n) = 1-p(k=n) = 1 - e^(-n^2)/2x
p(collision) > 0.5 if n ~ sqrt(x)
```
Hence if sqrt(2^64) keys that is 2^32 key are generated there is higher chance that there is a single collision.

2) Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?
```
x = 2^64 
Use the formula pc(k=n) = 1 - e^-(n^2)/2x
```
3) Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking?

This is a very interesting question because it depends on the size of key space. Suppose your keys are generated at random from space of size = s and hash space is x=2^64 as you mentioned. Probability of collision is Pc(k=n|x) = 1-e^(-n^2)/2x. If Probability of choosing same key in key space is P(k=n|s) = 1-e^(-n^2)/2s . For it to be sure that if hash is same then keys are same:-
```
P(k=n|s) > Pc(k=n|x)
1-e^-(n^2/2s) > 1-e^-(n^2/2x) 
n^2/2s > n^2/2x 
s < x
s < 2^64
```
Hence it shows that for keys to be same if hash is same that key set size must be small than 2^64 approx otherwise there is a chance of collision in hash more than in key set. The result is independent of number of keys generated.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...