How do I assess the hash collision probability?

前端未结

关注

 5  1336

I\'m developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary

相关标签:

5条回答

余生分开走

2020-11-27 03:45

I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.

from random import randint
from math import log
from collections import Counter

def colltest(exp):
    uniques = []
    while True:
        r = randint(0,2**exp)
        if r in uniques:
            return log(len(uniques) + 1, 2)
        uniques.append(r)

for k,v in Counter([colltest(20) for i in xrange(1000)]):
    print k, "hash orders of magnitude events before collission:",v

would print something like:

5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1

I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).

Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).

For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.

0 讨论(0)

刺人心

2020-11-27 03:47

Just because the probability is 1/X it does not mean that it won't happen to you until you have X records. It's like the lottery, you're not likely to win, but somebody out there will win.

With the speed and capacity of computers these days (not even talking about security, just reliability) there is really no reason not to just use a bigger/better hash function than MD5 for anything critical. Stepping up to SHA-1 should help you sleep better at night, but if you want to be extra cautious then go to SHA-265 and never think about it again.

If performance is truly an issue then use BLAKE2 which is actually faster than MD5 but supports 256+ bits to make collisions less likely while having same or better performance. However, while BLAKE2 has been well-adopted, it probably would require adding a new dependency to your project.

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2020-11-27 03:53

I think you shouldn't.

However, you should if you have the notion of two equal files having different (real names, not md5-based). Like, in search system two document might have exactly same content, but being distinct because they're located in different places.

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-11-27 03:55
Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.

There are no accidental MD5 collisions, 1,47x10^-29 is a really really really small number.

To overcome the issue of rehashing big files I would have a 3 phased identity scheme.
1. Filesize alone
2. Filesize + a hash of 64K * 4 in different positions in the file
3. A full hash
So if you see a file with a new size you know for certain you do not have a duplicate. And so on.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2020-11-27 03:55

Here's an interactive calculator that lets you estimate probability of collision for any hash size and number of objects - http://everydayinternetstuff.com/2015/04/hash-collision-probability-calculator/

0 讨论(0)
发布评论:

提交评论
- 加载中...