Download HTML page and its contents

前端 未结 4 1176
闹比i
闹比i 2020-12-05 04:34

Does Python have any way of downloading an entire HTML page and its contents (images, css) to a local folder given a url. And updating

相关标签:
4条回答
  • 2020-12-05 04:40

    You can use the urlib:

    import urllib.request
    
    opener = urllib.request.FancyURLopener({})
    url = "http://stackoverflow.com/"
    f = opener.open(url)
    content = f.read()
    
    0 讨论(0)
  • 2020-12-05 04:47

    You can use the urllib module to download individual URLs but this will just return the data. It will not parse the HTML and automatically download things like CSS files and images.

    If you want to download the "whole" page you will need to parse the HTML and find the other things you need to download. You could use something like Beautiful Soup to parse the HTML you retrieve.

    This question has some sample code doing exactly that.

    0 讨论(0)
  • 2020-12-05 04:47

    Function savePage bellow can:

    • Save the .html on the current folder
    • Downloads, javascripts, css and images based on the tags script, link and img.
      • Saved on a folder with suffix _files.
    • Any exceptions are printed on sys.stderr
      • returns a BeautifulSoup object

    Uses Python 3+ Requests, BeautifulSoup and other standard libraries.

    The function savePage receives a url and filename where to save it.

    You can expand/adapt it to suit your needs

    import os, sys
    import requests
    from urllib.parse import urljoin, urlparse
    from bs4 import BeautifulSoup
    import re
    
    def savePage(url, pagefilename='page'):
        def soupfindnSave(pagefolder, tag2find='img', inner='src'):
            """saves on specified `pagefolder` all tag2find objects"""
            if not os.path.exists(pagefolder): # create only once
                os.mkdir(pagefolder)
            for res in soup.findAll(tag2find):   # images, css, etc..
                try:         
                    if not res.has_attr(inner): # check if inner tag (file object) exists
                        continue # may or may not exist
                    filename = re.sub('\W+', '', os.path.basename(res[inner])) # clean special chars
                    fileurl = urljoin(url, res.get(inner))
                    filepath = os.path.join(pagefolder, filename)
                    # rename html ref so can move html and folder of files anywhere
                    res[inner] = os.path.join(os.path.basename(pagefolder), filename)
                    if not os.path.isfile(filepath): # was not downloaded
                        with open(filepath, 'wb') as file:
                            filebin = session.get(fileurl)
                            file.write(filebin.content)
                except Exception as exc:
                    print(exc, file=sys.stderr)
            return soup
        
        session = requests.Session()
        #... whatever other requests config you need here
        response = session.get(url)
        soup = BeautifulSoup(response.text, features="lxml")
        pagefolder = pagefilename+'_files' # page contents
        soup = soupfindnSave(pagefolder, 'img', 'src')
        soup = soupfindnSave(pagefolder, 'link', 'href')
        soup = soupfindnSave(pagefolder, 'script', 'src')
        with open(pagefilename+'.html', 'wb') as file:
            file.write(soup.prettify('utf-8'))
        return soup
    

    Example saving google.com as google.html and contents on google_files folder. (current folder)

    soup = savePage('https://www.google.com', 'google')
    
    0 讨论(0)
  • 2020-12-05 04:53

    What you're looking for is a mirroring tool. If you want one in Python, PyPI lists spider.py but I have no experience with it. Others might be better but I don't know - I use 'wget', which supports getting the CSS and the images. This probably does what you want (quoting from the manual)

    Retrieve only one HTML page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

    wget -p --convert-links http://www.server.com/dir/page.html
    
    0 讨论(0)
提交回复
热议问题