Stream multiple files at once from a Django server

问题

I'm running a Django server to serve files from another server in a protected network. When a user makes a request to access multiple files at once I'd like my Django server to stream these files all at once to that user.

As downloading multiple files at once is not easily possible in a browser the files need to be bundled somehow. I don't want my server having to download all files first and then serve a ready bundled file, because that adds a lot of timeloss for larger files. With zips my understanding is that it can not be streamed while being assembled.

Is there any way to start streaming a container as soon as the first bytes from the remote server are available?

回答1:

Tar-files are made to collect multiple files into one archive. They have been developed for tape recorders, so therefore offer sequential writes and reads.

With Django it is possible to stream files to a Browser with FileResponse(), which can take a generator as argument.

If we feed it a generator that assembles the tar-file with the data the user requested, the tar file will be generated just in time. However pythons built-in tarfile-module doesn't offer such capability out of the box.

We can however make use of tarfile's capability to pass in a File-like object to handle the assembly of the archive ourselves. We could therefore create a BytesIO() object that the tarfile will incrementally get written to and flush its contents to Django's FileResponse() method. For this to work we need to implement a few methods that FileResponse() and tarfile expects access to. Let's create a class FileStream:

class FileStream:
    def __init__(self):
        self.buffer = BytesIO()
        self.offset = 0

    def write(self, s):
        self.buffer.write(s)
        self.offset += len(s)

    def tell(self):
        return self.offset

    def close(self):
        self.buffer.close()

    def pop(self):
        s = self.buffer.getvalue()
        self.buffer.close()
        self.buffer = BytesIO()
        return s

Now when we write() data to FileStream's buffer and yield FileStream.pop() Django will send that data immediately to the user.

As data we now want to assemble that tar-file. In the FileStream class we add another method:

    @classmethod
    def yield_tar(cls, file_data_iterable):
        stream = FileStream()
        tar = tarfile.TarFile.open(mode='w|', fileobj=stream, bufsize=tarfile.BLOCKSIZE)

This creates a FileStream-instance and a file-handle in memory. The file-handle accesses the FileStream-instance to read and write data, instead of a file on disk.

Now in the tar-file we first have to add a tarfile.TarInfo() object that represents a header for the sequentially written data, with information like file name, size and time of modification.

        for file_name, file_size, file_date, file_data in file_data_iterable:
            tar_info = tarfile.TarInfo(file_name)
            tar_info.size = int(file_size)
            tar_info.mtime = file_date
            tar.addfile(tar_info)
            yield stream.pop()

You can also see the structure to pass in any data to that method. file_data_iterable is a list of tuples containing
((str) file_name, (int/str) file_size, (str) unix_timestamp, (bytes) file_data).

When the TarInfo has been sent iterate over the file_data. This data needs to be iterable. E.g you could use a requests.response object that you retrieve with requests.get(url, stream=True).

            for chunk in (requests.get(url, stream=True).iter_content(chunk_size=cls.RECORDSIZE)):
                # you can freely choose that chunk size, but this gives me good performance
                tar.fileobj.write(chunk)
                yield stream.pop()

Note: Here I used the variable url to request a file. You will need to pass it instead of file_data within the tuple arguments. If you choose to pass in an iterable file you will need to update this line.

Finally the tarfile requires a special format to indicate the file has finished. Tarfiles consist out of blocks and records. Usually a block contains 512 bytes, and a record contains 20 blocks (20*512 bytes = 10240 bytes). Firstly the last block containing the last chunk of file data is filled up with NULs (usually plain zeros), then the next TarInfo header of the next file begins.

To end the archive the current record will be filled up with NULs, but there have to be at least two blocks completely filled with NULs. This will get taken care of by tar.close(). Also see this Wiki.

            blocks, remainder = divmod(tar_info.size, tarfile.BLOCKSIZE)
            if remainder > 0:
                tar.fileobj.write(tarfile.NUL * (tarfile.BLOCKSIZE - remainder))
                yield stream.pop()
                blocks += 1
            tar.offset += blocks * tarfile.BLOCKSIZE
        tar.close()
        yield stream.pop()

You can now make use of the FileStream class in your Django view:

from django.http import FileResponse
import FileStream

def stream_files(request, files):
    file_data_iterable = [(
        file.name,
        file.size,
        file.date.timestamp(),
        file.data
    ) for file in files]

    response = FileReponse(
        FileStream.yield_tar(file_data_iterable),
        content_type="application/x-tar"
    )
    response["Content-Disposition"] = 'attachment; filename="streamed.tar"'
    return response

If you want to pass the size of the tar file so the user can see a progress bar you can determine the size of the uncompressed tar file ahead of time. In the FileStream class add another method:

    def tarsize(cls, sizes):
        # Each file is preceeded with a 512 byte long header
        header_size = 512
        # Each file will be appended to fill up a block
        tar_sizes = [ceil((header_size + size) / tarfile.BLOCKSIZE)
                     * tarfile.BLOCKSIZE for size in sizes]
        # the end of the archive is marked by at least two consecutive
        # zero filled blocks, and the final record block is filled up with
        # zeros.
        sum_size = sum(tar_sizes)
        remainder = cls.RECORDSIZE - (sum_size % cls.RECORDSIZE)
        if remainder < 2 * tarfile.BLOCKSIZE:
            sum_size += cls.RECORDSIZE
        total_size = sum_size + remainder
        assert total_size % cls.RECORDSIZE == 0
        return total_size

and use that to set the response header:

tar_size = FileStream.tarsize([file.size for file in files])
...
response["Content-Length"] = tar_size

Huge thanks to chipx86 and allista whose gists have helped me massively with this task.

来源：https://stackoverflow.com/questions/64169858/stream-multiple-files-at-once-from-a-django-server

标签

python

django

django-views

zip

tar