How do I download a file over HTTP using Python?

前端 未结 25 2979
感动是毒
感动是毒 2020-11-21 07:17

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I\'ve added to iTunes.

The te

相关标签:
25条回答
  • 2020-11-21 07:59

    Source code can be:

    import urllib
    sock = urllib.urlopen("http://diveintopython.org/")
    htmlSource = sock.read()                            
    sock.close()                                        
    print htmlSource  
    
    0 讨论(0)
  • 2020-11-21 08:01

    Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.

    0 讨论(0)
  • 2020-11-21 08:01

    Late answer, but for python>=3.6 you can use:

    import dload
    dload.save(url)
    

    Install dload with:

    pip3 install dload
    
    0 讨论(0)
  • 2020-11-21 08:02

    You can get the progress feedback with urlretrieve as well:

    def report(blocknr, blocksize, size):
        current = blocknr*blocksize
        sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
    
    def downloadFile(url):
        print "\n",url
        fname = url.split('/')[-1]
        print fname
        urllib.urlretrieve(url, fname, report)
    
    0 讨论(0)
  • 2020-11-21 08:02

    I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.

    After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.

    It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.

    Here it is:

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    from __future__ import (division, absolute_import, print_function, unicode_literals)
    import sys, os, argparse
    from bs4 import BeautifulSoup
    
    # --- insert Stan's script here ---
    # if sys.version_info >= (3,): 
    #...
    #...
    # def download_file(url, dest=None): 
    #...
    #...
    
    # --- new stuff ---
    def collect_all_url(page_url, extensions):
        """
        Recovers all links in page_url checking for all the desired extensions
        """
        conn = urllib2.urlopen(page_url)
        html = conn.read()
        soup = BeautifulSoup(html, 'lxml')
        links = soup.find_all('a')
    
        results = []    
        for tag in links:
            link = tag.get('href', None)
            if link is not None: 
                for e in extensions:
                    if e in link:
                        # Fallback for badly defined links
                        # checks for missing scheme or netloc
                        if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
                            results.append(link)
                        else:
                            new_url=urlparse.urljoin(page_url,link)                        
                            results.append(new_url)
        return results
    
    if __name__ == "__main__":  # Only run if this file is called directly
        # Command line arguments
        parser = argparse.ArgumentParser(
            description='Download all files from a webpage.')
        parser.add_argument(
            '-u', '--url', 
            help='Page url to request')
        parser.add_argument(
            '-e', '--ext', 
            nargs='+',
            help='Extension(s) to find')    
        parser.add_argument(
            '-d', '--dest', 
            default=None,
            help='Destination where to save the files')
        parser.add_argument(
            '-p', '--par', 
            action='store_true', default=False, 
            help="Turns on parallel download")
        args = parser.parse_args()
    
        # Recover files to download
        all_links = collect_all_url(args.url, args.ext)
    
        # Download
        if not args.par:
            for l in all_links:
                try:
                    filename = download_file(l, args.dest)
                    print(l)
                except Exception as e:
                    print("Error while downloading: {}".format(e))
        else:
            from multiprocessing.pool import ThreadPool
            results = ThreadPool(10).imap_unordered(
                lambda x: download_file(x, args.dest), all_links)
            for p in results:
                print(p)
    

    An example of its usage is:

    python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
    

    And an actual example if you want to see it in action:

    python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
    
    0 讨论(0)
  • 2020-11-21 08:04

    Use urllib.request.urlopen():

    import urllib.request
    with urllib.request.urlopen('http://www.example.com/') as f:
        html = f.read().decode('utf-8')
    

    This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.

    On Python 2, the method is in urllib2:

    import urllib2
    response = urllib2.urlopen('http://www.example.com/')
    html = response.read()
    
    0 讨论(0)
提交回复
热议问题