Using Python Requests to 'bridge' a file without loading into memory?

后端 未结 4 733
情深已故
情深已故 2021-01-02 05:18

I\'d like to use the Python Requests library to GET a file from a url and use it as a mulitpart encoded file in a post request. The catch is that the file could be very larg

相关标签:
4条回答
  • 2021-01-02 05:38

    As other answers have pointed out already: requests doesn't support POSTing multipart-encoded files without loading them into memory.

    To upload a large file without loading it into memory using multipart/form-data, you could use poster:

    #!/usr/bin/env python
    import sys
    from urllib2 import Request, urlopen
    
    from poster.encode import multipart_encode # $ pip install poster
    from poster.streaminghttp import register_openers
    
    register_openers() # install openers globally
    
    def report_progress(param, current, total):
        sys.stderr.write("\r%03d%% of %d" % (int(1e2*current/total + .5), total))
    
    url = 'http://example.com/path/'
    params = {'file': open(sys.argv[1], "rb"), 'name': 'upload test'}
    response = urlopen(Request(url, *multipart_encode(params, cb=report_progress)))
    print response.read()
    

    It can be adapted to allow a GET response object instead of a local file:

    import posixpath
    import sys
    from urllib import unquote
    from urllib2 import Request, urlopen
    from urlparse import urlsplit
    
    from poster.encode import MultipartParam, multipart_encode # pip install poster
    from poster.streaminghttp import register_openers
    
    register_openers() # install openers globally
    
    class MultipartParamNoReset(MultipartParam):
        def reset(self):
            pass # do nothing (to allow self.fileobj without seek() method)
    
    get_url = 'http://example.com/bigfile'
    post_url = 'http://example.com/path/'
    
    get_response = urlopen(get_url)
    param = MultipartParamNoReset(
        name='file',
        filename=posixpath.basename(unquote(urlsplit(get_url).path)), #XXX \ bslash
        filetype=get_response.headers['Content-Type'],
        filesize=int(get_response.headers['Content-Length']),
        fileobj=get_response)
    
    params = [('name', 'upload test'), param]
    datagen, headers = multipart_encode(params, cb=report_progress)
    post_response = urlopen(Request(post_url, datagen, headers))
    print post_response.read()
    

    This solution requires a valid Content-Length header (known file size) in the GET response. If the file size is unknown then the chunked transfer encoding could be used to upload the multipart/form-data content. A similar solution could be implemented using urllib3.filepost that is shipped with requests library e.g., based on @AdrienF's answer without using poster.

    0 讨论(0)
  • 2021-01-02 05:46

    There actually is an issue about that on Kenneth Reitz's GitHub repo. I had the same problem (although I'm just uploading a local file), and I added a wrapper class that is a list of streams corresponding to the different parts of the requests, with a read() attribute that iterates through the list and reads each part, and also gets necessary values for the headers (boundary and content-length) :

    # coding=utf-8
    
    from __future__ import unicode_literals
    from mimetools import choose_boundary
    from requests.packages.urllib3.filepost import iter_fields, get_content_type
    from io import BytesIO
    import codecs
    
    writer = codecs.lookup('utf-8')[3]
    
    class MultipartUploadWrapper(object):
    
        def __init__(self, files):
            """
            Initializer
    
            :param files:
                A dictionary of files to upload, of the form {'file': ('filename', <file object>)}
            :type network_down_callback:
                Dict
            """
            super(MultipartUploadWrapper, self).__init__()
            self._cursor = 0
            self._body_parts = None
            self.content_type_header = None
            self.content_length_header = None
            self.create_request_parts(files)
    
        def create_request_parts(self, files):
            request_list = []
            boundary = choose_boundary()
            content_length = 0
    
            boundary_string = b'--%s\r\n' % (boundary)
            for fieldname, value in iter_fields(files):
                content_length += len(boundary_string)
    
                if isinstance(value, tuple):
                    filename, data = value
                    content_disposition_string = (('Content-Disposition: form-data; name="%s"; ''filename="%s"\r\n' % (fieldname, filename))
                                                + ('Content-Type: %s\r\n\r\n' % (get_content_type(filename))))
    
                else:
                    data = value
                    content_disposition_string =  (('Content-Disposition: form-data; name="%s"\r\n' % (fieldname))
                                                + 'Content-Type: text/plain\r\n\r\n')
                request_list.append(BytesIO(str(boundary_string + content_disposition_string)))
                content_length += len(content_disposition_string)
                if hasattr(data, 'read'):
                    data_stream = data
                else:
                    data_stream = BytesIO(str(data))
    
                data_stream.seek(0,2)
                data_size = data_stream.tell()
                data_stream.seek(0)
    
                request_list.append(data_stream)
                content_length += data_size
    
                end_string = b'\r\n'
                request_list.append(BytesIO(end_string))
                content_length += len(end_string)
    
            request_list.append(BytesIO(b'--%s--\r\n' % (boundary)))
            content_length += len(boundary_string)
    
            # There's a bug in httplib.py that generates a UnicodeDecodeError on binary uploads if
            # there are *any* unicode strings passed into headers as part of the requests call.
            # For this reason all strings are explicitly converted to non-unicode at this point.
            self.content_type_header = {b'Content-Type': b'multipart/form-data; boundary=%s' % boundary}
            self.content_length_header = {b'Content-Length': str(content_length)}
            self._body_parts = request_list
    
        def read(self, chunk_size=0):
            remaining_to_read = chunk_size
            output_array = []
            while remaining_to_read > 0:
                body_part = self._body_parts[self._cursor]
                current_piece = body_part.read(remaining_to_read)
                length_read = len(current_piece)
                output_array.append(current_piece)
                if length_read < remaining_to_read:
                    # we finished this piece but haven't read enough, moving on to the next one
                    remaining_to_read -= length_read
                    if self._cursor == len(self._body_parts) - 1:
                        break
                    else:
                        self._cursor += 1
                else:
                    break
            return b''.join(output_array)
    

    So instead of passing a 'files' keyword arg, you pass this object as 'data' attribute to your Request.request object

    Edit

    I've cleaned up the code

    0 讨论(0)
  • 2021-01-02 05:48

    You can not turn anything you please into a context manager in python. It requires very specific attributes to be one. With your current code you can do the following:

    response = requests.get(big_file_url, stream=True)
    
    post_response = requests.post(upload_url, files={'file': ('filename', response.iter_content())})
    

    Using iter_content will ensure that your file is never in memory. The iterator will be used, otherwise by using the content attribute the file will be loaded into memory.

    Edit The only way to reasonably do this is to use chunk-encoded uploads, e.g.,

    post_response = requests.post(upload_url, data=response.iter_content())
    

    If you absolutely need to do multipart/form-data encoding then you will have to create an abstraction layer that will take the generator in the constructor, and the Content-Length header from response (to provide an answer for len(file)) that will have a read attribute that will read from the generator. The issue again is that I'm pretty sure the entire thing will be read into memory before it will be uploaded.

    Edit #2

    You might be able to make a generator of your own that produces the multipart/form-data encoded data yourself. You could pass that in the same way as you would chunk-encoded-requests but you'd have to make sure you set your own Content-Type and Content-Length headers. I don't have time to sketch an example but it shouldn't be too difficult.

    0 讨论(0)
  • 2021-01-02 05:59

    In theory you can just the raw object

    In [1]: import requests
    
    In [2]: raw = requests.get("http://download.thinkbroadband.com/1GB.zip", stream=True).raw
    
    In [3]: raw.read(10)
    Out[3]: '\xff\xda\x18\x9f@\x8d\x04\xa11_'
    
    In [4]: raw.read(10)
    Out[4]: 'l\x15b\x8blVO\xe7\x84\xd8'
    
    In [5]: raw.read() # take forever...
    
    In [6]: raw = requests.get("http://download.thinkbroadband.com/5MB.zip", stream=True).raw
    
    In [7]: requests.post("http://www.amazon.com", {'file': ('thing.zip', raw, 'application/zip')}, stream=True)
    Out[7]: <Response [200]>
    
    0 讨论(0)
提交回复
热议问题