Uniquely identifying URLs with one 64-bit number

前端 未结 5 1302
闹比i
闹比i 2021-02-09 05:10

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what ki

5条回答
  •  被撕碎了的回忆
    2021-02-09 05:26

    From what I see, you need a hash function with the following requirements,

    1. Hash arbitrary length strings to a 64-bit value
      • Be good -- Avoid collisions
      • Not necessarily one-way (security not required)
      • Preferably fast -- which is a necessary characteristic for a non-security application

    This hash function survey may be useful for drilling down to the function most suitable for you.
    I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).

    You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).

    Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.

提交回复
热议问题