Django uploads: Discard uploaded duplicates, use existing file (md5 based check)

前端 未结 5 1988
别那么骄傲
别那么骄傲 2020-12-07 17:11

I have a model with a FileField, which holds user uploaded files. Since I want to save space, I would like to avoid duplicates.

What I\'d like t

相关标签:
5条回答
  • 2020-12-07 17:19

    This answer helped me solve the problem where I wanted to raise an exception if the file being uploaded already existed. This version raises an exception if a file with the same name already exists in the upload location.

    from django.core.files.storage import FileSystemStorage
    
    class FailOnDuplicateFileSystemStorage(FileSystemStorage):
        def get_available_name(self, name):
            return name
    
        def _save(self, name, content):
            if self.exists(name):
                raise ValidationError('File already exists: %s' % name)
    
            return super(
                FailOnDuplicateFileSystemStorage, self)._save(name, content)
    
    0 讨论(0)
  • 2020-12-07 17:22

    AFAIK you can't easily implement this using save/delete methods coz files are handled quite specifically.

    But you could try smth like that.

    First, my simple md5 file hash function:

    def md5_for_file(chunks):
        md5 = hashlib.md5()
        for data in chunks:
            md5.update(data)
        return md5.hexdigest()
    

    Next simple_upload_to is is smth like yours media_file_name function. You should use it like that:

    def simple_upload_to(field_name, path='files'):
        def upload_to(instance, filename):
            name = md5_for_file(getattr(instance, field_name).chunks())
            dot_pos = filename.rfind('.')
            ext = filename[dot_pos:][:10].lower() if dot_pos > -1 else '.unknown'
            name += ext
            return os.path.join(path, name[:2], name)
        return upload_to
    
    class Media(models.Model):
        # see info about storage below
        orig_file = models.FileField(upload_to=simple_upload_to('orig_file'), storage=MyCustomStorage())
    

    Of course, it's just an example so path generation logic could be various.

    And the most important part:

    from django.core.files.storage import FileSystemStorage
    
    class MyCustomStorage(FileSystemStorage):
        def get_available_name(self, name):
            return name
    
        def _save(self, name, content):
            if self.exists(name):
                self.delete(name)
            return super(MyCustomStorage, self)._save(name, content)
    

    As you can see this custom storage deletes file before saving and then saves new one with the same name. So here you can implement your logic if NOT deleting (and thus updating) files is important.

    More about storages ou can find here: https://docs.djangoproject.com/en/1.5/ref/files/storage/

    0 讨论(0)
  • 2020-12-07 17:24

    Data goes from template -> forms -> views -> db(model). It makes sense to stop the duplicates at the earliest step itself. In this case forms.py.

    # scripts.py
    import hashlib
    from .models import *
    def generate_sha(file):
        sha = hashlib.sha1()
        file.seek(0)
        while True:
            buf = file.read(104857600)
            if not buf:
                break
            sha.update(buf)
        sha1 = sha.hexdigest()
        file.seek(0)
        return sha1
    
    # models.py
    class images(models.Model):
        label = models.CharField(max_length=21, blank=False, null=False)
        image = models.ImageField(upload_to='images/')
        image_sha1 = models.CharField(max_length=40, blank=False, null=False)
        create_time = models.DateTimeField(auto_now=True)
    
    # forms.py
    class imageForm(forms.Form):
        Label = forms.CharField(max_length=21, required=True)
        Image = forms.ImageField(required=True)
    
        def clean(self):
            cleaned_data = super(imageForm, self).clean()
            Label = cleaned_data.get('Label')
            Image = cleaned_data.get('Image')
            sha1 = generate_sha(Image)
            if images.objects.filter(image_sha1=sha1).exists():
                raise forms.ValidationError('This already exists')
            if not Label:
                raise forms.ValidationError('No Label')
            if not Image:
                raise forms.ValidationError('No Image')
    
    # views.py
    from .scripts import *
    from .models import *
    from .forms import *
    
    def image(request):
        if request.method == 'POST':
            form = imageForm(request.POST, request.FILES)
            if form.is_valid():
                photo = images (
                    payee=request.user,
                    image=request.FILES['Image'],
                    image_sha1=generate_sha(request.FILES['Image'],),
                    label=form.cleaned_data.get('Label'),
                    )
                photo.save()
                return render(request, 'stars/image_form.html', {'form' : form})
        else:
            form = imageForm()
        context = {'form': form,}
        return render(request, 'stars/image_form.html', context)
    
    # image_form.html
    {% extends "base.html" %}
    {% load static %}
    {% load staticfiles %}
    
    {% block content %}
    
     <div class="card mx-auto shadow p-3 mb-5 bg-white rounded text-left" style="max-width: 50rem;">
        <div class="container">
            <form action="{% url 'wallet' %}" method="post" enctype="multipart/form-data">
                {% csrf_token %}
                {{ form  }}
                <input type="submit" value="Upload" class="btn btn-outlined-primary">
            </form>
    
            {% if form.errors %}
                {% for field in form %}
                    {% for error in field.errors %}
                        <p> {{ error }} </p>
                    {% endfor %}
                {% endfor %}
            {% endif %}
    
        </div>
    </div>
    
    {% endblock content %}  
    
    

    reference: http://josephmosby.com/2015/05/13/preventing-file-dupes-in-django.html

    0 讨论(0)
  • 2020-12-07 17:33

    Thanks to alTus answer, I was able to figure out that writing a custom storage class is the key, and it was easier than expected.

    • I just omit calling the superclasses _save method to write the file if it is already there and I just return the name.
    • I overwrite get_available_name, to avoid getting numbers appended to the file name if a file with the same name is already existing

    I don't know if this is the proper way of doing it, but it works fine so far.

    Hope this is useful!

    Here's the complete sample code:

    import hashlib
    import os
    
    from django.core.files.storage import FileSystemStorage
    from django.db import models
    
    class MediaFileSystemStorage(FileSystemStorage):
        def get_available_name(self, name, max_length=None):
            if max_length and len(name) > max_length:
                raise(Exception("name's length is greater than max_length"))
            return name
    
        def _save(self, name, content):
            if self.exists(name):
                # if the file exists, do not call the superclasses _save method
                return name
            # if the file is new, DO call it
            return super(MediaFileSystemStorage, self)._save(name, content)
    
    
    def media_file_name(instance, filename):
        h = instance.md5sum
        basename, ext = os.path.splitext(filename)
        return os.path.join('mediafiles', h[0:1], h[1:2], h + ext.lower())
    
    
    class Media(models.Model):
        # use the custom storage class fo the FileField
        orig_file = models.FileField(
            upload_to=media_file_name, storage=MediaFileSystemStorage())
        md5sum = models.CharField(max_length=36)
        # ...
    
        def save(self, *args, **kwargs):
            if not self.pk:  # file is new
                md5 = hashlib.md5()
                for chunk in self.orig_file.chunks():
                    md5.update(chunk)
                self.md5sum = md5.hexdigest()
            super(Media, self).save(*args, **kwargs)
    
    0 讨论(0)
  • 2020-12-07 17:34

    I had the same issue and found this SO question. As this is nothing too uncommon I searched the web and found the following Python package which seams to do exactly what you want:

    https://pypi.python.org/pypi/django-hashedfilenamestorage

    If SHA1 hashes are out of question I think a pull request to add MD5 hashing support would be a great idea.

    0 讨论(0)
提交回复
热议问题