When downloading a large file with python, I want to put a time limit not only for the connection process, but also for the download.
I am trying with the following
Run download in a thread which you can then abort if not finished on time.
import requests
import threading
URL='http://ipv4.download.thinkbroadband.com/1GB.zip'
TIMEOUT=0.5
def download(return_value):
return_value.append(requests.get(URL))
return_value = []
download_thread = threading.Thread(target=download, args=(return_value,))
download_thread.start()
download_thread.join(TIMEOUT)
if download_thread.is_alive():
print 'The download was not finished on time...'
else:
print return_value[0].headers['content-length']
When using Requests' prefetch=False
parameter, you get to pull in arbitrary-sized chunks of the respone at a time (rather than all at once).
What you'll need to do is tell Requests not to preload the entire request and keep your own time of how much you've spent reading so far, while fetching small chunks at a time. You can fetch a chunk using r.raw.read(CHUNK_SIZE)
. Overall, the code will look something like this:
import requests
import time
CHUNK_SIZE = 2**12 # Bytes
TIME_EXPIRE = time.time() + 5 # Seconds
r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', prefetch=False)
data = ''
buffer = r.raw.read(CHUNK_SIZE)
while buffer:
data += buffer
buffer = r.raw.read(CHUNK_SIZE)
if TIME_EXPIRE < time.time():
# Quit after 5 seconds.
data += buffer
break
r.raw.release_conn()
print "Read %s bytes out of %s expected." % (len(data), r.headers['content-length'])
Note that this might sometimes use a bit more than the 5 seconds allotted as the final r.raw.read(...)
could lag an arbitrary amount of time. But at least it doesn't depend on multithreading or socket timeouts.
And the answer is: do not use requests, as it is blocking. Use non-blocking network I/O, for example eventlet:
import eventlet
from eventlet.green import urllib2
from eventlet.timeout import Timeout
url5 = 'http://ipv4.download.thinkbroadband.com/5MB.zip'
url10 = 'http://ipv4.download.thinkbroadband.com/10MB.zip'
urls = [url5, url5, url10, url10, url10, url5, url5]
def fetch(url):
response = bytearray()
with Timeout(60, False):
response = urllib2.urlopen(url).read()
return url, len(response)
pool = eventlet.GreenPool()
for url, length in pool.imap(fetch, urls):
if (not length):
print "%s: timeout!" % (url)
else:
print "%s: %s" % (url, length)
Produces expected results:
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880