I\'d like to use the Python Requests library to GET a file from a url and use it as a mulitpart encoded file in a post request. The catch is that the file could be very larg
As other answers have pointed out already: requests doesn't support POSTing multipart-encoded files without loading them into memory.
To upload a large file without loading it into memory using multipart/form-data, you could use poster:
#!/usr/bin/env python
import sys
from urllib2 import Request, urlopen
from poster.encode import multipart_encode # $ pip install poster
from poster.streaminghttp import register_openers
register_openers() # install openers globally
def report_progress(param, current, total):
sys.stderr.write("\r%03d%% of %d" % (int(1e2*current/total + .5), total))
url = 'http://example.com/path/'
params = {'file': open(sys.argv[1], "rb"), 'name': 'upload test'}
response = urlopen(Request(url, *multipart_encode(params, cb=report_progress)))
print response.read()
It can be adapted to allow a GET response object instead of a local file:
import posixpath
import sys
from urllib import unquote
from urllib2 import Request, urlopen
from urlparse import urlsplit
from poster.encode import MultipartParam, multipart_encode # pip install poster
from poster.streaminghttp import register_openers
register_openers() # install openers globally
class MultipartParamNoReset(MultipartParam):
def reset(self):
pass # do nothing (to allow self.fileobj without seek() method)
get_url = 'http://example.com/bigfile'
post_url = 'http://example.com/path/'
get_response = urlopen(get_url)
param = MultipartParamNoReset(
name='file',
filename=posixpath.basename(unquote(urlsplit(get_url).path)), #XXX \ bslash
filetype=get_response.headers['Content-Type'],
filesize=int(get_response.headers['Content-Length']),
fileobj=get_response)
params = [('name', 'upload test'), param]
datagen, headers = multipart_encode(params, cb=report_progress)
post_response = urlopen(Request(post_url, datagen, headers))
print post_response.read()
This solution requires a valid Content-Length
header (known file size) in the GET response. If the file size is unknown then the chunked transfer encoding could be used to upload the multipart/form-data content. A similar solution could be implemented using urllib3.filepost
that is shipped with requests
library e.g., based on @AdrienF's answer without using poster
.
There actually is an issue about that on Kenneth Reitz's GitHub repo. I had the same problem (although I'm just uploading a local file), and I added a wrapper class that is a list of streams corresponding to the different parts of the requests, with a read() attribute that iterates through the list and reads each part, and also gets necessary values for the headers (boundary and content-length) :
# coding=utf-8
from __future__ import unicode_literals
from mimetools import choose_boundary
from requests.packages.urllib3.filepost import iter_fields, get_content_type
from io import BytesIO
import codecs
writer = codecs.lookup('utf-8')[3]
class MultipartUploadWrapper(object):
def __init__(self, files):
"""
Initializer
:param files:
A dictionary of files to upload, of the form {'file': ('filename', <file object>)}
:type network_down_callback:
Dict
"""
super(MultipartUploadWrapper, self).__init__()
self._cursor = 0
self._body_parts = None
self.content_type_header = None
self.content_length_header = None
self.create_request_parts(files)
def create_request_parts(self, files):
request_list = []
boundary = choose_boundary()
content_length = 0
boundary_string = b'--%s\r\n' % (boundary)
for fieldname, value in iter_fields(files):
content_length += len(boundary_string)
if isinstance(value, tuple):
filename, data = value
content_disposition_string = (('Content-Disposition: form-data; name="%s"; ''filename="%s"\r\n' % (fieldname, filename))
+ ('Content-Type: %s\r\n\r\n' % (get_content_type(filename))))
else:
data = value
content_disposition_string = (('Content-Disposition: form-data; name="%s"\r\n' % (fieldname))
+ 'Content-Type: text/plain\r\n\r\n')
request_list.append(BytesIO(str(boundary_string + content_disposition_string)))
content_length += len(content_disposition_string)
if hasattr(data, 'read'):
data_stream = data
else:
data_stream = BytesIO(str(data))
data_stream.seek(0,2)
data_size = data_stream.tell()
data_stream.seek(0)
request_list.append(data_stream)
content_length += data_size
end_string = b'\r\n'
request_list.append(BytesIO(end_string))
content_length += len(end_string)
request_list.append(BytesIO(b'--%s--\r\n' % (boundary)))
content_length += len(boundary_string)
# There's a bug in httplib.py that generates a UnicodeDecodeError on binary uploads if
# there are *any* unicode strings passed into headers as part of the requests call.
# For this reason all strings are explicitly converted to non-unicode at this point.
self.content_type_header = {b'Content-Type': b'multipart/form-data; boundary=%s' % boundary}
self.content_length_header = {b'Content-Length': str(content_length)}
self._body_parts = request_list
def read(self, chunk_size=0):
remaining_to_read = chunk_size
output_array = []
while remaining_to_read > 0:
body_part = self._body_parts[self._cursor]
current_piece = body_part.read(remaining_to_read)
length_read = len(current_piece)
output_array.append(current_piece)
if length_read < remaining_to_read:
# we finished this piece but haven't read enough, moving on to the next one
remaining_to_read -= length_read
if self._cursor == len(self._body_parts) - 1:
break
else:
self._cursor += 1
else:
break
return b''.join(output_array)
So instead of passing a 'files' keyword arg, you pass this object as 'data' attribute to your Request.request object
I've cleaned up the code
You can not turn anything you please into a context manager in python. It requires very specific attributes to be one. With your current code you can do the following:
response = requests.get(big_file_url, stream=True)
post_response = requests.post(upload_url, files={'file': ('filename', response.iter_content())})
Using iter_content
will ensure that your file is never in memory. The iterator will be used, otherwise by using the content
attribute the file will be loaded into memory.
Edit The only way to reasonably do this is to use chunk-encoded uploads, e.g.,
post_response = requests.post(upload_url, data=response.iter_content())
If you absolutely need to do multipart/form-data encoding then you will have to create an abstraction layer that will take the generator in the constructor, and the Content-Length
header from response
(to provide an answer for len(file)
) that will have a read attribute that will read from the generator. The issue again is that I'm pretty sure the entire thing will be read into memory before it will be uploaded.
Edit #2
You might be able to make a generator of your own that produces the multipart/form-data
encoded data yourself. You could pass that in the same way as you would chunk-encoded-requests but you'd have to make sure you set your own Content-Type
and Content-Length
headers. I don't have time to sketch an example but it shouldn't be too difficult.
In theory you can just the raw object
In [1]: import requests
In [2]: raw = requests.get("http://download.thinkbroadband.com/1GB.zip", stream=True).raw
In [3]: raw.read(10)
Out[3]: '\xff\xda\x18\x9f@\x8d\x04\xa11_'
In [4]: raw.read(10)
Out[4]: 'l\x15b\x8blVO\xe7\x84\xd8'
In [5]: raw.read() # take forever...
In [6]: raw = requests.get("http://download.thinkbroadband.com/5MB.zip", stream=True).raw
In [7]: requests.post("http://www.amazon.com", {'file': ('thing.zip', raw, 'application/zip')}, stream=True)
Out[7]: <Response [200]>