Image URL Naming Scheme

问题

Prologue: I'm building a sort of CMS/social networking service that will host many images.

I'm intending on using Eucalyptus/Amazon S3 to store the images and was wondering about the significance of the seemingly-random file-names used by sites like Tumblr, Twitter, &c., e.g.

31.media.tumblr.com/d6ba16060ea4dfd3c67ccf4dbc91df92/tumblr_n164cyLkNl1qkdb42o1_500.jpg

and

pbs.twimg.com/media/Bg7B_kBCMAABYfF.jpg

How do they generate these strings, and what benefits does this incur over just incrementing an integer for each file name? Maybe just random characters? Maybe hashing an integer?

Thanks!

回答1:

This is a way to organize media and to guarantee that media will not get written over if another file has the same file name. For example if Twitter had a million photos in its pbs.twimg.com/media/ directory and it is possible that two out of those million photos were named cat.jpg, Twitter would run into an issue uploading the second file with the same name or calling for a file where two exist with the same name. In result, Twitter (amongst other applications) has created a way to prevent the database from getting those two files mixed up and in result renames the file after compressing it to a file name with much more specificity: a set of numbers, letters, and symbols that may seem random but are incrementally generated.

In your CMS, I suggest creating some sort of failsafe to prevent two files from clashing, whether it's one trying to write over another when uploaded or if it's retrieving one file that has the same name as another. You can do this in a few different ways. One method would be as I just described, rename the file and create a system that auto-increments the files' names. Do not generate these file names in an obvious pattern because then all media will be easily accessible through the address bar. This is another reason why the URLs are not as readable.

You can also apply the file_exists() function in your uploader. This is a PHP function that checks whether or not a file with a certain name already exists in a certain directory. Read more about that function here.

Hope this helps.

回答2:

My guess about the tumblr file naming scheme is as follows:

d6ba16060ea4dfd3c67ccf4dbc91df92 - hash of the image file, might be MD5 or SHA-1
tumblr_n164cyLkNl1qkdb42o1_500.jpg - several parts:
tumblr_ - obvious prefix to advertise the site
n164cyLkNl1qkdb42o - consists of 2 parts, 10 characters before '1' and 7 after
n164cyLkNl - some kind of a hash for post ID that the image belongs to. Might be a custom alphabet Base64 value
qkdb42o - hash of the tumblr blog name.
Then goes the number, in this case '1' - # of image in a photoset, if it's a single photo then it's just '1'.
Finally, _500 - maximum width of image in pixels.

Source: I have collected quite a lot of images and tags from tumblr and the pattern turned out to be obvious. You can see how tagging manner is the same for the same blog name hash, while tags of posts with same post number hash are 100% identical.

Now, if only there was a way to decode those hashes back to original value (assuming they're not actually hashes but encoded values, which is unlikely).

来源：https://stackoverflow.com/questions/21919766/image-url-naming-scheme

标签

twitter

amazon-s3

content-management-system

Eucalyptus