The book Numerical Recipes offers a method to calculate 64bit hash codes in order to reduce the number of collisions.
The algorithm is shown at http://www.javamex.com/tu
1) Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?
The probability of a single collision occurring depends on the key set generated as the hash function is uniform we can do following to calculate the probability that collision doesnt occurs at generation of k keys as follows :-
x = hash size
p(k=2) = (x-1)/x
p(k=3) = p(k=2)*(x-2)/x
..
p(k=n) = (x-1)*(x-2)..(x-n+1)/x^n
p(k=n) ~ e^-(n*n)/2x
p(collision|k=n) = 1-p(k=n) = 1 - e^(-n^2)/2x
p(collision) > 0.5 if n ~ sqrt(x)
Hence if sqrt(2^64)
keys that is 2^32
key are generated there is higher chance that there is a single collision.
2) Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?
x = 2^64
Use the formula pc(k=n) = 1 - e^-(n^2)/2x
3) Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking?
This is a very interesting question because it depends on the size of key space. Suppose your keys are generated at random from space of size = s
and hash space is x=2^64
as you mentioned. Probability of collision is Pc(k=n|x) = 1-e^(-n^2)/2x
. If Probability of choosing same key in key space is P(k=n|s) = 1-e^(-n^2)/2s
. For it to be sure that if hash is same then keys are same:-
P(k=n|s) > Pc(k=n|x)
1-e^-(n^2/2s) > 1-e^-(n^2/2x)
n^2/2s > n^2/2x
s < x
s < 2^64
Hence it shows that for keys to be same if hash is same that key set size must be small than 2^64
approx otherwise there is a chance of collision in hash more than in key set. The result is independent of number of keys generated.