How do I download a file over HTTP using Python?

前端未结

关注

 25  3014

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I\'ve added to iTunes.

The te

相关标签:

25条回答

甜味超标

2020-11-21 07:59

Source code can be:

import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()                            
sock.close()                                        
print htmlSource

0 讨论(0)

时光取名叫无心

2020-11-21 08:01

Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2020-11-21 08:01
Late answer, but for python>=3.6 you can use:
```
import dload
dload.save(url)
```
Install dload with:
```
pip3 install dload
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

深忆病人

2020-11-21 08:02

You can get the progress feedback with urlretrieve as well:

def report(blocknr, blocksize, size):
    current = blocknr*blocksize
    sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))

def downloadFile(url):
    print "\n",url
    fname = url.split('/')[-1]
    print fname
    urllib.urlretrieve(url, fname, report)

0 讨论(0)

别那么骄傲

2020-11-21 08:02

I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.

After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.

It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.

Here it is:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup

# --- insert Stan's script here ---
# if sys.version_info >= (3,): 
#...
#...
# def download_file(url, dest=None): 
#...
#...

# --- new stuff ---
def collect_all_url(page_url, extensions):
    """
    Recovers all links in page_url checking for all the desired extensions
    """
    conn = urllib2.urlopen(page_url)
    html = conn.read()
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('a')

    results = []    
    for tag in links:
        link = tag.get('href', None)
        if link is not None: 
            for e in extensions:
                if e in link:
                    # Fallback for badly defined links
                    # checks for missing scheme or netloc
                    if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
                        results.append(link)
                    else:
                        new_url=urlparse.urljoin(page_url,link)                        
                        results.append(new_url)
    return results

if __name__ == "__main__":  # Only run if this file is called directly
    # Command line arguments
    parser = argparse.ArgumentParser(
        description='Download all files from a webpage.')
    parser.add_argument(
        '-u', '--url', 
        help='Page url to request')
    parser.add_argument(
        '-e', '--ext', 
        nargs='+',
        help='Extension(s) to find')    
    parser.add_argument(
        '-d', '--dest', 
        default=None,
        help='Destination where to save the files')
    parser.add_argument(
        '-p', '--par', 
        action='store_true', default=False, 
        help="Turns on parallel download")
    args = parser.parse_args()

    # Recover files to download
    all_links = collect_all_url(args.url, args.ext)

    # Download
    if not args.par:
        for l in all_links:
            try:
                filename = download_file(l, args.dest)
                print(l)
            except Exception as e:
                print("Error while downloading: {}".format(e))
    else:
        from multiprocessing.pool import ThreadPool
        results = ThreadPool(10).imap_unordered(
            lambda x: download_file(x, args.dest), all_links)
        for p in results:
            print(p)

An example of its usage is:

python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>

And an actual example if you want to see it in action:

python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics

0 讨论(0)

佛祖请我去吃肉

2020-11-21 08:04
Use urllib.request.urlopen():
```
import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
    html = f.read().decode('utf-8')
```
This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.

On Python 2, the method is in urllib2:
```
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...