How to generate a unique hash for a URL?

前端 未结 12 1332
时光取名叫无心
时光取名叫无心 2020-12-28 08:11

Given these two images from twitter.

http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_pr         


        
相关标签:
12条回答
  • 2020-12-28 08:44

    The git content management system is based on SHA1 because it has very minimal chance for collision.

    If it good for git it will be good to you so.

    0 讨论(0)
  • 2020-12-28 08:44

    It appears that the numerical part of twimg.com URLs are already a unique value for each image. My research indicates that the number is sequential (i.e. the example url below is for the 433,484,366th profile image ever uploaded - which just happens to be mine). Thus, this number is unique. My solution would be to simply use the numerical part of the filename as the "hash value", with no fear of ever finding a non-unique value.

    • URL: http:​//a2.twimg.com/profile_images/433484366/terrorbite-industries-256.png
    • Filename: 433484366.terrorbite-industries-256.png
    • Unique ID: 433484366

    I already use this system for a Python script that displays notifications for new tweets, and as part of its operation it caches profile image thumbnails to reduce unneccessary downloads.

    P.S. It makes no difference what subdomain the image is downloaded from, all images are available from all subdomains.

    0 讨论(0)
  • 2020-12-28 08:47

    The nature of a hash is that it may result in collisions. How about one of these alternatives:

    1. use a directory tree. Literally create sub directories for each component of the URL.
    2. Generate a uniques id. The problem here is how to keep the mapping between real name and saved id. You could use a database which maps between a URL and generated unique id. You can simply insert a record into a database which generates unique ids, and then use that id as the filename.
    0 讨论(0)
  • 2020-12-28 08:51

    A very simple approach:

    f( "http://a3.twimg.com/profile_images/130500759/" ) = a3_130500759.jpg
    f( "http://a1.twimg.com/profile_images/58079916/" )  = a1_58079916.jpg
    

    As the other parts of this URL are constant, you can use the subdomain, the last part of the query path as a unique filename.

    Don't know what could be a problem with this solution

    0 讨论(0)
  • 2020-12-28 08:53

    It seems what you really want is to have a legal filename that won't collide with others.

    • Any encoding of the URL will work, even base64: e.g. filename = base64(url)
    • A crypto hash will give you what you want - although you claim this will be a performance bottleneck, don't be sure until you've benchmarked
    0 讨论(0)
  • 2020-12-28 08:57

    I'm playing with thumbalizr using a modified version of their caching script, and it has a few good solutions I think. The code is on github.com/mptre/thumbalizr but the short version is that is uses md5 to build the file names, and it takes the first two characters from the filename and uses it to create a folder which is named the exact same thing. This means that it is easy to break the folders up, and fast to find the corresponding folder without a database. Kind of blew my mind with it's simplicity.

    It generates file names like this http://pappmaskin.no/opensource/delicious_snapcasa/mptre-thumbalizr/cache/fc/fcc3a328e0f4c1b51bf5e13747614e7a_1280_1024_8_90_250.png

    the last part, _1280_1024_8_90_250, matches the different settings that the script uses when talking to the thumbalizr api, but I guess fcc3a328e0f4c1b51bf5e13747614e7a is a straight md5 of the url, in this case for thumbalizr.com

    I tried changing the config to generate images 200px wide, and that images goes in the same folder, but instead of _250.png it is called _200.png

    I haven't had time to dig that much in the code, but I'm sure it could be pulled apart from the thumbalizr logic and made more generic.

    0 讨论(0)
提交回复
热议问题