For each of our binary assets we generate a MD5 hash. This is used to check whether a certain binary asset is already in our application. But is it possible that two differe
Yes, it is possible. It is called a Hash collision.
Having said that, algorithms such as MD5 are designed to minimize the probability of a collision.
The Wikipedia entry on MD5 explains some vulnerabilities in MD5, which you should be aware of.
Yes, it is possible that two different strings can generate the same MD5 hash code.
Here is a simple test using very similar binary message in hex string:
$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c6b384c4968b28812b676b49d40c09f8af4ed4cc -
008ee33a9d58b51cfeb425b0959121c9
$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c728d8d93091e9c7b87b43d9e33829379231d7ca -
008ee33a9d58b51cfeb425b0959121c9
They generate different SHA-1 sum, but the same MD5 hash value. Secondly the strings are very similar, so it's difficult to find the difference between them.
The difference can be found by the following command:
$ diff -u <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2 | fold -w2) <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2 | fold -w2)
--- /dev/fd/63 2016-02-05 12:55:04.000000000 +0000
+++ /dev/fd/62 2016-02-05 12:55:04.000000000 +0000
@@ -33,7 +33,7 @@
af
bf
a2
-00
+02
a8
28
4b
@@ -53,7 +53,7 @@
6d
a0
d1
-55
+d5
5d
83
60
Above collision example is taken from Marc Stevens: Single-block collision for MD5, 2012; he explains his method, with source code (alternate link to the paper).
Another test:
$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
756f3044edf52611a51a8fa7ec8f95e273f21f82 -
cee9a457e790cf20d4bdaa6d69f01e41
$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
6d5294e385f50c12745a4d901285ddbffd3842cb -
cee9a457e790cf20d4bdaa6d69f01e41
Different SHA-1 sum, the same MD5 hash.
Difference is in one byte:
$ diff -u <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2) <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2)
--- /dev/fd/63 2016-02-05 12:56:43.000000000 +0000
+++ /dev/fd/62 2016-02-05 12:56:43.000000000 +0000
@@ -19,7 +19,7 @@
03
65
9e
-70
+74
4f
85
34
@@ -41,7 +41,7 @@
a3
f4
15
-5c
+dc
bb
86
07
Above example is adapted from Tao Xie and Dengguo Feng: Construct MD5 Collisions Using Just A Single Block Of Message, 2010.
Related:
I realize this is old, but thought I would contribute my solution. There are a 2^128 possible hash combinations. And thus a 2^64 probability of a birthday paradox. While the solution below won't eliminate possibility of collisions, it surely will reduce the risk by a very substantial amount.
2^64 = 18,446,744,073,709,500,000 possible combinations
What I have done is I put a few hashes together based on the input string to get a much longer resulting string that you consider your hash...
So my pseudo-code for this is:
Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))
That is to practical improbability of a collision. But if you want to be super paranoid and can't have it happen, and storage space is not an issue (nor is computing cycles)...
Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))
& Hash(Reverse(SpellOutLengthWithWords(Length(string))))
& Hash(Rotate13(string)) Hash(Hash(string)) & Hash(Reverse(Hash(string)))
Okay, not the cleanest solution, but this now gets you a lot more play with how infrequently you will run into a collision. To the point I might assume impossibility in all realistic senses of the term.
For my sake, I think the possibility of a collision is infrequent enough that I will consider this not "surefire" but so unlikely to happen that it suits the need.
Now the possible combinations goes up significantly. While you could spend a long time on how many combinations this could get you, I will say in theory it lands you SIGNIFICANTLY more than the quoted number above of
2^64 (or 18,446,744,073,709,551,616)
Likely by a hundred more digits or so. The theoretical max this could give you would be
Possible number of resulting strings:
528294531135665246352339784916516606518847326036121522127960709026673902556724859474417255887657187894674394993257128678882347559502685537250538978462939576908386683999005084168731517676426441053024232908211188404148028292751561738838396898767036476489538580897737998336
MD5 is a hash function – so yes, two different strings can absolutely generate colliding MD5 codes.
In particular, note that MD5 codes have a fixed length so the possible number of MD5 codes is limited. The number of strings (of any length), however, is definitely unlimited so it logically follows that there must be collisions.
Yes, it is possible. This is in fact a Birthday problem. However the probability of two randomly chosen strings having the same MD5 hash is very low.
See this and this questions for examples.