Efficiently saving a file by hash in Django

问题

I am working on a Django project. What I want the user to be able to do is upload a file (through a Form) and then save the file locally to a custom path and with a custom filename - its hash. The only solution I can think of is by using the "upload_to" argument of the FileField I'm using. What this translates to (I think):

1) Write the file to disk

2) Calculate hash

3) Return path + hash as filename

The problem is that there are two write operations: one when saving the file from memory to disk to calculate the hash, and another one when actually saving the file to specified location.

Is there a way to override FileField's save to disk method (or where can I find exactly what's going on behind the scenes) so that I can basically save the file using a temporary name and then rename it to hash, instead of having it be saved twice.

Thanks.

回答1:

The upload_to parameter of FileField accepts a callable, and the string returned from that is joined to your MEDIA_ROOT setting to get the final filename (from the documentation):

This may also be a callable, such as a function, which will be called to obtain the upload path, including the filename. This callable must be able to accept two arguments, and return a Unix-style path (with forward slashes) to be passed along to the storage system. The two arguments that will be passed are:

instance: An instance of the model where the FileField is defined. More specifically, this is the particular instance where the current file is being attached. In most cases, this object will not have been saved to the database yet, so if it uses the default AutoField, it might not yet have a value for its primary key field.

filename: The filename that was originally given to the file. This may or may not be taken into account when determining the final destination path.

Additionally, when you access model.my_file_field, it resolves to an instance of FieldFile, which acts like a file. So, you should be able to write an upload_to like the following:

def hash_upload(instance, filename):
    instance.my_file.open() # make sure we're at the beginning of the file
    contents = instance.my_file.read() # get the contents
    fname, ext = os.path.splitext(filename)
    return "{0}_{1}{2}".format(fname, hash_function(contents), ext) # assemble the filename

Substitute the appropriate hash function you'd like to use. Saving to the disk isn't necessary at all (in fact, the file is often already uploaded to temporary storage, or in the case of smaller files just kept in memory).

You'd use this like:

class MyModel(models.Model):
    my_file = models.FileField(upload_to=hash_upload,...)

I haven't tested this yet, so you might have to poke at the line that reads the whole file (and you may want to just hash the first chunk of the file to prevent malicious users from uploading massive files and causing DoS attacks). You can get the first chunk with
instance.my_file.read(instance.my_file.DEFAULT_CHUNK_SIZE).

回答2:

Updated answer for at least 1.10:

Your instance.my_file_field is an instance of UploadedFile and not a file-like object
It can't be opened or closed, only read and possibly in chunks
Calling read() unconditionally may consume all available physical memory

In the following example the instance has a class method "get_image_basedir", as there's several models that all use the same function, but require a different base directory. I left that in, since it's a common pattern. The HASH_CHUNK_SIZE is a variable set by myself and chosen to optimize disk reads (i.e. matching the block size of the file system or a multiple thereof).

def get_image_path(instance, filename):
    import os.path
    import hashlib
    base = instance.get_image_basedir()
    parts = os.path.splitext(filename)
    ctx = hashlib.sha256()
    if instance.img.multiple_chunks():
        for data in instance.img.chunks(HASH_CHUNK_SIZE):
            ctx.update(data)
    else:
        ctx.update(instance.img.read())
    return os.path.join(base, ctx.hexdigest() + parts[1])

来源：https://stackoverflow.com/questions/31731470/efficiently-saving-a-file-by-hash-in-django

标签

python

django

hash