How do I download a file over HTTP using Python?

前端 未结 25 3012
感动是毒
感动是毒 2020-11-21 07:17

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I\'ve added to iTunes.

The te

相关标签:
25条回答
  • 2020-11-21 07:40

    In 2012, use the python requests library

    >>> import requests
    >>> 
    >>> url = "http://download.thinkbroadband.com/10MB.zip"
    >>> r = requests.get(url)
    >>> print len(r.content)
    10485760
    

    You can run pip install requests to get it.

    Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.


    2015-12-30

    People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:

    from tqdm import tqdm
    import requests
    
    url = "http://download.thinkbroadband.com/10MB.zip"
    response = requests.get(url, stream=True)
    
    with open("10MB", "wb") as handle:
        for data in tqdm(response.iter_content()):
            handle.write(data)
    

    This is essentially the implementation @kvance described 30 months ago.

    0 讨论(0)
  • 2020-11-21 07:40

    Following are the most commonly used calls for downloading files in python:

    1. urllib.urlretrieve ('url_to_file', file_name)

    2. urllib2.urlopen('url_to_file')

    3. requests.get(url)

    4. wget.download('url', file_name)

    Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.

    0 讨论(0)
  • 2020-11-21 07:40

    If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.

    First, these are the results (they are similar in different runs):

    $ python wget_test.py 
    urlretrive_test : starting
    urlretrive_test : 6.56
    ==============
    wget_no_bar_test : starting
    wget_no_bar_test : 7.20
    ==============
    wget_with_bar_test : starting
    100% [......................................................................] 541335552 / 541335552
    wget_with_bar_test : 50.49
    ==============
    

    The way I performed the test is using "profile" decorator. This is the full code:

    import wget
    import urllib
    import time
    from functools import wraps
    
    def profile(func):
        @wraps(func)
        def inner(*args):
            print func.__name__, ": starting"
            start = time.time()
            ret = func(*args)
            end = time.time()
            print func.__name__, ": {:.2f}".format(end - start)
            return ret
        return inner
    
    url1 = 'http://host.com/500a.iso'
    url2 = 'http://host.com/500b.iso'
    url3 = 'http://host.com/500c.iso'
    
    def do_nothing(*args):
        pass
    
    @profile
    def urlretrive_test(url):
        return urllib.urlretrieve(url)
    
    @profile
    def wget_no_bar_test(url):
        return wget.download(url, out='/tmp/', bar=do_nothing)
    
    @profile
    def wget_with_bar_test(url):
        return wget.download(url, out='/tmp/')
    
    urlretrive_test(url1)
    print '=============='
    time.sleep(1)
    
    wget_no_bar_test(url2)
    print '=============='
    time.sleep(1)
    
    wget_with_bar_test(url3)
    print '=============='
    time.sleep(1)
    

    urllib seems to be the fastest

    0 讨论(0)
  • 2020-11-21 07:41

    use wget module:

    import wget
    wget.download('url')
    
    0 讨论(0)
  • 2020-11-21 07:41

    You can use PycURL on Python 2 and 3.

    import pycurl
    
    FILE_DEST = 'pycurl.html'
    FILE_SRC = 'http://pycurl.io/'
    
    with open(FILE_DEST, 'wb') as f:
        c = pycurl.Curl()
        c.setopt(c.URL, FILE_SRC)
        c.setopt(c.WRITEDATA, f)
        c.perform()
        c.close()
    
    0 讨论(0)
  • 2020-11-21 07:45

    Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.

    import subprocess
    subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
    

    In Jupyter Notebook, one can also call programs directly with the ! syntax:

    !wget -O example_output_file.html https://example.com
    
    0 讨论(0)
提交回复
热议问题