substr md5 collision

前端 未结 3 1904
迷失自我
迷失自我 2021-01-18 10:40

I need a 4-character hash. At the moment I am taking the first 4 characters of a md5() hash. I am hashing a string which is 80 characters long or less. Will thi

相关标签:
3条回答
  • 2021-01-18 11:06

    Well, each character of md5 is a hex bit. That means it can have one of 16 possible values. So if you're only using the first 4 "hex-bits", that means you can have 16 * 16 * 16 * 16 or 16^4 or 65536 or 2^16 possibilities.

    So, that means that the total available "space" for results is only 16 bits wide. Now, according to the Birthday Attack/Problem, there are the following chances for collision:

    • 50% chance -> 300 entries
    • 1% chance -> 36 entries
    • 0.0000001% chance -> 2 entries.

    So there is quite a high chance for collisions.

    Now, you say you need a 4 character hash. Depending on the exact requirements, you can do:

    • 4 hex-bits for 16^4 (65,536) possible values
    • 4 alpha bits for 26^4 (456,976) possible values
    • 4 alpha numeric bits for 36^4 (1,679,616) possible values
    • 4 ascii printable bits for about 93^4 (74,805,201) possible values (assuming ASCII 33 -> 126)
    • 4 full bytes for 256^4 (4,294,967,296) possible values.

    Now, which you choose will depend on the actual use case. Does the hash need to be transmitted to a browser? How are you storing it, etc.

    I'll give an example of each (In PHP, but should be easy to translate / see what's going on):

    4 Hex-Bits:

    $hash = substr(md5($data), 0, 4);
    

    4 Alpha bits:

    $hash = substr(base_convert(md5($data), 16, 26)0, 4);
    $hash = str_replace(range(0, 9), range('S', 'Z'), $hash);
    

    4 Alpha Numeric bits:

    $hash = substr(base_convert(md5($data), 16, 36), 0, 4);
    

    4 Printable Assci Bits:

    $hash = hash('md5', $data, true); // We want the raw bytes
    $out = '';
    for ($i = 0; $i < 4; $i++) {
        $out .= chr((ord($hash[$i]) % 93) + 33);
    }
    

    4 full bytes:

    $hash = substr(hash('md5', $data, true), 0, 4); // We want the raw bytes
    
    0 讨论(0)
  • 2021-01-18 11:11

    4 first characters contains 4*4 = 16 bits of data, so collision will be definitely at 65536 elements, and, due to birthday attack, it will be found much faster. You should use more bits of hash.

    0 讨论(0)
  • 2021-01-18 11:21

    Surprisingly high indeed. As you can see from this graph of an approximate collision probability (formula from the wikipedia page), with just a few hundred elements your probability of having a collision is over 50%.

    Note, of course, if you're facing the possibility of an attacker providing the string, you can probably assume that it's 100% - scanning to find a collision in a 16-bit search space can be done almost instantaneously on any modern PC. Or even any modern cell phone, for that matter.

    0 讨论(0)
提交回复
热议问题