Requests with multiple connections

后端 未结 3 1279
猫巷女王i
猫巷女王i 2020-12-05 15:55

I use the Python Requests library to download a big file, e.g.:

r = requests.get(\"http://bigfile.com/bigfile.bin\")
content = r.content

Th

相关标签:
3条回答
  • 2020-12-05 16:19

    This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.

    It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.

    #!/usr/bin/python
    
    import os
    import urllib
    import re
    import subprocess
    
    MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
    
    # retrieve webpage source code
    f = urllib.urlopen(MY_HTTP_LOC)
    page = f.read()
    f.close
    
    # extract relevant URL segments from source code
    rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
    results =  re.findall(rgxp,str(page))
    files = []
    for match in results:
        files.append(match[1])
    
    # download (using aria2c) files
    for afile in files:
        if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
            print 'Skipping already-retrieved file: ' + afile
        else:
            print 'Downloading file: ' + afile          
            subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
    
    0 讨论(0)
  • 2020-12-05 16:20

    You can use HTTP Range header to fetch just part of file (already covered for python here).

    Just start several threads and fetch different range with each and you're done ;)

    def download(url,start):
        req = urllib2.Request('http://www.python.org/')
        req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
        f = urllib2.urlopen(req)
        parts[start] = f.read()
    
    threads = []
    parts = {}
    
    # Initialize threads
    for i in range(0,10):
        t = threading.Thread(target=download, i*chunk_size)
        t.start()
        threads.append(t)
    
    # Join threads back (order doesn't matter, you just want them all)
    for i in threads:
        i.join()
    
    # Sort parts and you're done
    result = ''.join(parts[i] for i in sorted(parts.keys()))
    

    Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).

    0 讨论(0)
  • 2020-12-05 16:32

    Here's a Python script that saves given url to a file and uses multiple threads to download it:

    #!/usr/bin/env python
    import sys
    from functools import partial
    from itertools import count, izip
    from multiprocessing.dummy import Pool # use threads
    from urllib2 import HTTPError, Request, urlopen
    
    def download_chunk(url, byterange):
        req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
        try:
            return urlopen(req).read()
        except HTTPError as e:
            return b''  if e.code == 416 else None  # treat range error as EOF
        except EnvironmentError:
            return None
    
    def main():
        url, filename = sys.argv[1:]
        pool = Pool(4) # define number of concurrent connections
        chunksize = 1 << 16
        ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
        with open(filename, 'wb') as file:
            for s in pool.imap(partial(download_part, url), ranges):
                if not s:
                    break # error or EOF
                file.write(s)
                if len(s) != chunksize:
                    break  # EOF (servers with no Range support end up here)
    
    if __name__ == "__main__":
        main()
    

    The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.

    It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).

    It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.

    To use multiple processes instead of threads, change the import:

    from multiprocessing.pool import Pool # use processes (other code unchanged)
    
    0 讨论(0)
提交回复
热议问题