Given these two images from twitter.
http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_pr
You said:
I don't want a cryptographic algorithm as it this needs to be a performant operation.
Well, I understand your need for speed, but I think you need to consider drawbacks from your approach. If you just need to create hash for urls, you should stick with it and don't to write a new algorithm, where you'll need to deal with collisions, for instance.
So you could have a Dictionary<string, string>
to work as a cache to your urls. So, when you get a new address, you first do a lookup in that list and, if doesn't find a match, hash it and storage for future usage.
Following this line, you could give MD5 a try:
public static void Main(string[] args)
{
foreach (string url in new string[]{
"http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg",
"http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg" })
{
Console.WriteLine(HashIt(url));
}
}
private static string HashIt(string url)
{
Uri path = new Uri(new Uri(url), ".");
MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();
byte[] data = md5.ComputeHash(
Encoding.ASCII.GetBytes(path.OriginalString));
return Convert.ToBase64String(data);
}
You'll get:
rEoztCAXVyy0AP/6H7w3TQ==
0idVyXLs6sCP/XLBXwtCXA==
One of the key concepts of a URL is that it is unique. Why not use it?
Every algorithm that shortens the info, can produce collisions. Maybe unlikely, but possible nevertheless
While CRC32 produces a maximum 2^32 values regardless of your input and so will not avoid conflicts, it is still a viable option for this scenario.
It is fast, so if you generate filename that conflicts, just add/change a character to your URL and simply re-calc the CRC.
4.3 billion possible checksums mean the likelihood of a filename conflict, when combined with the original filename, are going to be so low as to be be unimportant in normal situations.
I've used this approach myself for something similar and was pleased with the performance. See Fast CRC32 in Software.
Irrespective of the how you do it (hashing, encoding, database lookup) I recommend that you don't try to map a huge number of URLs to files in a big flat directory.
The reason is that file lookup for most file systems involves a linear scan through the filenames in a directory. So if all N of your files are in one directory, a lookup will involve 1/2 N comparisons on average; i.e. O(N)
(Note that ReiserFS organizes the names in a directory as a BTree. However, ReiserFS seems to be the exception rather than the rule.)
Instead of one big flat directory, it would be better to map the URIs to a tree of directories. Depending on the shape of the tree, lookup can be as good as O(logN)
. For example, if you organized the tree so that it had 3 levels of directory with at most 100 entries in each directory, you could accommodate 1 million URLs. If you designed the mapping to use 2 character filenames, each directory should easily fit into a single disk block, and a pathname lookup (assuming that the required directories are already cached) should take a few microseconds.
I see your question is what is the best hash algorithm for this matter. You might want to check this Best hashing algorithm in terms of hash collisions and performance for strings
You can use UUID Class in Java to generate anything into UUID from bytes which is unique and you won't be having a problem with file lookup
String url = http://www.google.com;
String shortUrl = UUID.nameUUIDFromBytes("http://www.google.com".getBytes()).toString();