Uniquely identifying URLs with one 64-bit number

前端 未结 5 1066
无人共我
无人共我 2021-02-09 05:12

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what ki

5条回答
  •  [愿得一人]
    2021-02-09 05:39

    If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.

    I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.

提交回复
热议问题