Using Python Requests to 'bridge' a file without loading into memory?

后端未结

关注

 4  733

I\'d like to use the Python Requests library to GET a file from a url and use it as a mulitpart encoded file in a post request. The catch is that the file could be very larg

相关标签:

4条回答

醉酒成梦

2021-01-02 05:38

As other answers have pointed out already: requests doesn't support POSTing multipart-encoded files without loading them into memory.

To upload a large file without loading it into memory using multipart/form-data, you could use poster:

#!/usr/bin/env python
import sys
from urllib2 import Request, urlopen

from poster.encode import multipart_encode # $ pip install poster
from poster.streaminghttp import register_openers

register_openers() # install openers globally

def report_progress(param, current, total):
    sys.stderr.write("\r%03d%% of %d" % (int(1e2*current/total + .5), total))

url = 'http://example.com/path/'
params = {'file': open(sys.argv[1], "rb"), 'name': 'upload test'}
response = urlopen(Request(url, *multipart_encode(params, cb=report_progress)))
print response.read()

It can be adapted to allow a GET response object instead of a local file:

import posixpath
import sys
from urllib import unquote
from urllib2 import Request, urlopen
from urlparse import urlsplit

from poster.encode import MultipartParam, multipart_encode # pip install poster
from poster.streaminghttp import register_openers

register_openers() # install openers globally

class MultipartParamNoReset(MultipartParam):
    def reset(self):
        pass # do nothing (to allow self.fileobj without seek() method)

get_url = 'http://example.com/bigfile'
post_url = 'http://example.com/path/'

get_response = urlopen(get_url)
param = MultipartParamNoReset(
    name='file',
    filename=posixpath.basename(unquote(urlsplit(get_url).path)), #XXX \ bslash
    filetype=get_response.headers['Content-Type'],
    filesize=int(get_response.headers['Content-Length']),
    fileobj=get_response)

params = [('name', 'upload test'), param]
datagen, headers = multipart_encode(params, cb=report_progress)
post_response = urlopen(Request(post_url, datagen, headers))
print post_response.read()

This solution requires a valid Content-Length header (known file size) in the GET response. If the file size is unknown then the chunked transfer encoding could be used to upload the multipart/form-data content. A similar solution could be implemented using urllib3.filepost that is shipped with requests library e.g., based on @AdrienF's answer without using poster.

0 讨论(0)

你的背包

2021-01-02 05:46

There actually is an issue about that on Kenneth Reitz's GitHub repo. I had the same problem (although I'm just uploading a local file), and I added a wrapper class that is a list of streams corresponding to the different parts of the requests, with a read() attribute that iterates through the list and reads each part, and also gets necessary values for the headers (boundary and content-length) :

# coding=utf-8

from __future__ import unicode_literals
from mimetools import choose_boundary
from requests.packages.urllib3.filepost import iter_fields, get_content_type
from io import BytesIO
import codecs

writer = codecs.lookup('utf-8')[3]

class MultipartUploadWrapper(object):

    def __init__(self, files):
        """
        Initializer

        :param files:
            A dictionary of files to upload, of the form {'file': ('filename', <file object>)}
        :type network_down_callback:
            Dict
        """
        super(MultipartUploadWrapper, self).__init__()
        self._cursor = 0
        self._body_parts = None
        self.content_type_header = None
        self.content_length_header = None
        self.create_request_parts(files)

    def create_request_parts(self, files):
        request_list = []
        boundary = choose_boundary()
        content_length = 0

        boundary_string = b'--%s\r\n' % (boundary)
        for fieldname, value in iter_fields(files):
            content_length += len(boundary_string)

            if isinstance(value, tuple):
                filename, data = value
                content_disposition_string = (('Content-Disposition: form-data; name="%s"; ''filename="%s"\r\n' % (fieldname, filename))
                                            + ('Content-Type: %s\r\n\r\n' % (get_content_type(filename))))

            else:
                data = value
                content_disposition_string =  (('Content-Disposition: form-data; name="%s"\r\n' % (fieldname))
                                            + 'Content-Type: text/plain\r\n\r\n')
            request_list.append(BytesIO(str(boundary_string + content_disposition_string)))
            content_length += len(content_disposition_string)
            if hasattr(data, 'read'):
                data_stream = data
            else:
                data_stream = BytesIO(str(data))

            data_stream.seek(0,2)
            data_size = data_stream.tell()
            data_stream.seek(0)

            request_list.append(data_stream)
            content_length += data_size

            end_string = b'\r\n'
            request_list.append(BytesIO(end_string))
            content_length += len(end_string)

        request_list.append(BytesIO(b'--%s--\r\n' % (boundary)))
        content_length += len(boundary_string)

        # There's a bug in httplib.py that generates a UnicodeDecodeError on binary uploads if
        # there are *any* unicode strings passed into headers as part of the requests call.
        # For this reason all strings are explicitly converted to non-unicode at this point.
        self.content_type_header = {b'Content-Type': b'multipart/form-data; boundary=%s' % boundary}
        self.content_length_header = {b'Content-Length': str(content_length)}
        self._body_parts = request_list

    def read(self, chunk_size=0):
        remaining_to_read = chunk_size
        output_array = []
        while remaining_to_read > 0:
            body_part = self._body_parts[self._cursor]
            current_piece = body_part.read(remaining_to_read)
            length_read = len(current_piece)
            output_array.append(current_piece)
            if length_read < remaining_to_read:
                # we finished this piece but haven't read enough, moving on to the next one
                remaining_to_read -= length_read
                if self._cursor == len(self._body_parts) - 1:
                    break
                else:
                    self._cursor += 1
            else:
                break
        return b''.join(output_array)

So instead of passing a 'files' keyword arg, you pass this object as 'data' attribute to your Request.request object

Edit

I've cleaned up the code

0 讨论(0)

-上瘾入骨i

2021-01-02 05:48
You can not turn anything you please into a context manager in python. It requires very specific attributes to be one. With your current code you can do the following:
```
response = requests.get(big_file_url, stream=True)

post_response = requests.post(upload_url, files={'file': ('filename', response.iter_content())})
```
Using iter_content will ensure that your file is never in memory. The iterator will be used, otherwise by using the content attribute the file will be loaded into memory.

Edit The only way to reasonably do this is to use chunk-encoded uploads, e.g.,
```
post_response = requests.post(upload_url, data=response.iter_content())
```
If you absolutely need to do multipart/form-data encoding then you will have to create an abstraction layer that will take the generator in the constructor, and the Content-Length header from response (to provide an answer for len(file)) that will have a read attribute that will read from the generator. The issue again is that I'm pretty sure the entire thing will be read into memory before it will be uploaded.

Edit #2

You might be able to make a generator of your own that produces the multipart/form-data encoded data yourself. You could pass that in the same way as you would chunk-encoded-requests but you'd have to make sure you set your own Content-Type and Content-Length headers. I don't have time to sketch an example but it shouldn't be too difficult.
0 讨论(0)
发布评论:

提交评论
- 加载中...

你的背包

2021-01-02 05:59

In theory you can just the raw object

In [1]: import requests

In [2]: raw = requests.get("http://download.thinkbroadband.com/1GB.zip", stream=True).raw

In [3]: raw.read(10)
Out[3]: '\xff\xda\x18\x9f@\x8d\x04\xa11_'

In [4]: raw.read(10)
Out[4]: 'l\x15b\x8blVO\xe7\x84\xd8'

In [5]: raw.read() # take forever...

In [6]: raw = requests.get("http://download.thinkbroadband.com/5MB.zip", stream=True).raw

In [7]: requests.post("http://www.amazon.com", {'file': ('thing.zip', raw, 'application/zip')}, stream=True)
Out[7]: <Response [200]>

0 讨论(0)