How to generate a unique hash for a URL?

前端未结

关注

 12  1339

Given these two images from twitter.

http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_pr


                      
              相关标签:


      
      
        
          12条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-28 09:01
              
            
            
                                                                       
You said:


  I don't want a cryptographic algorithm as it this needs to be a performant operation.


Well, I understand your need for speed, but I think you need to consider drawbacks from your approach. If you just need to create hash for urls, you should stick with it and don't to write a new algorithm, where you'll need to deal with collisions, for instance.

So you could have a Dictionary<string, string> to work as a cache to your urls. So, when you get a new address, you first do a lookup in that list and, if doesn't find a match, hash it and storage for future usage.

Following this line, you could give MD5 a try:

public static void Main(string[] args)
{
    foreach (string url in new string[]{ 
        "http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg", 
        "http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg" })
    {
        Console.WriteLine(HashIt(url));
    }
}

private static string HashIt(string url)
{
    Uri path = new Uri(new Uri(url), ".");
    MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();
    byte[] data = md5.ComputeHash(
        Encoding.ASCII.GetBytes(path.OriginalString));
    return Convert.ToBase64String(data);
}


You'll get:

rEoztCAXVyy0AP/6H7w3TQ==
0idVyXLs6sCP/XLBXwtCXA==

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2020-12-28 09:02
              
            
            
                                                                       
One of the key concepts of a URL is that it is unique.  Why not use it?  

Every algorithm that shortens the info, can produce collisions.  Maybe unlikely, but possible nevertheless
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-12-28 09:02
              
            
            
                                                                       
While CRC32 produces a maximum 2^32 values regardless of your input and so will not avoid conflicts, it is still a viable option for this scenario.

It is fast, so if you generate filename that conflicts, just add/change a character to your URL and simply re-calc the CRC.  

4.3 billion possible checksums mean the likelihood of a filename conflict, when combined with the original filename, are going to be so low as to be be unimportant in normal situations.

I've used this approach myself for something similar and was pleased with the performance.
See Fast CRC32 in Software. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2020-12-28 09:03
              
            
            
                                                                       
Irrespective of the how you do it (hashing, encoding, database lookup) I recommend that you don't try to map a huge number of URLs to files in a big flat directory.

The reason is that file lookup for most file systems involves a linear scan through the filenames in a directory.  So if all N of your files are in one directory, a lookup will involve 1/2 N comparisons on average; i.e. O(N)  (Note that ReiserFS organizes the names in a directory as a BTree.  However, ReiserFS seems to be the exception rather than the rule.)

Instead of one big flat directory, it would be better to map the URIs to a tree of directories.  Depending on the shape of the tree, lookup can be as good as O(logN).  For example, if you organized the tree so that it had 3 levels of directory with at most 100 entries in each directory, you could accommodate 1 million URLs.  If you designed the mapping to use 2 character filenames, each directory should easily fit into a single disk block, and a pathname lookup (assuming that the required directories are already cached) should take a few microseconds.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2020-12-28 09:03
              
            
            
                                                                       
I see your question is what is the best hash algorithm for this matter. You might want to check this  Best hashing algorithm in terms of hash collisions and performance for strings 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-28 09:09
              
            
            
                                                                       
You can use UUID Class in Java to generate anything into UUID from bytes which is unique and you won't be having a problem with file lookup

String url = http://www.google.com;
String shortUrl = UUID.nameUUIDFromBytes("http://www.google.com".getBytes()).toString();

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复