For each of our binary assets we generate a MD5 hash. This is used to check whether a certain binary asset is already in our application. But is it possible that two differe
Yes, it is! Collision will be a possibility (although, the risk is very small). If not, you would have a pretty effective compression method!
EDIT: As Konrad Rudolph says: A potentially unlimited set of input converted to a finite set of output (32 hex chars) will results in an endless number of collisions.
For a set of even billions of assets, the chances of random collisions are negligibly small -- nothing that you should worry about. Considering the birthday paradox, given a set of 2^64 (or 18,446,744,073,709,551,616) assets, the probability of a single MD5 collision within this set is 50%. At this scale, you'd probably beat Google in terms of storage capacity.
However, because the MD5 hash function has been broken (it's vulnerable to a collision attack), any determined attacker can produce 2 colliding assets in a matter of seconds worth of CPU power. So if you want to use MD5, make sure that such an attacker would not compromise the security of your application!
Also, consider the ramifications if an attacker could forge a collision to an existing asset in your database. While there are no such known attacks (preimage attacks) against MD5 (as of 2011), it could become possible by extending the current research on collision attacks.
If these turn out to be a problem, I suggest looking at the SHA-2 series of hash functions (SHA-256, SHA-384 and SHA-512). The downside is that it's slightly slower and has longer hash output.
Just to be more informative.
From a math point of view, Hash functions are not injective.
It means that there is not a 1 to 1 (but one way) relationship between the starting set and the resulting one.
Bijection on wikipedia
EDIT: to be complete injective hash functions exist: it's called Perfect hashing.
Yes, of course: MD5 hashes have a finite length, but there are an infinite number of possible character strings that can be MD5-hashed.
I think we need to be careful choosing the hashing algorithm as per our requirement, as hash collisions are not as rare as I expected. I recently found a very simple case of hash collision in my project. I am using Python wrapper of xxhash for hashing. Link: https://github.com/ewencp/pyhashxx
s1 = 'mdsAnalysisResult105588'
s2 = 'mdsAlertCompleteResult360224'
pyhashxx.hashxx(s1) # Out: 2535747266
pyhashxx.hashxx(s2) # Out: 2535747266
It caused a very tricky caching issue in the system, then I finally found that it's a hash collision.
As other people have said, yes, there can be collisions between two different inputs. However, in your use case, I don't see that being a problem. I highly doubt that you will run into collisions - I've used MD5 for fingerprinting hundreds of thousands of image files of a number of image (JPG, bitmap, PNG, raw) formats at a previous job and I didn't have a collision.
However, if you are trying to fingerprint some kind of data, perhaps you could use two hash algorithms - the odds of one input resulting in the same output of two different algorithms is near impossible.