How to download a full webpage with a Python script?

前端 未结 4 1930
星月不相逢
星月不相逢 2020-12-09 18:36

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page

相关标签:
4条回答
  • 2020-12-09 18:59

    You can easily do that with simple python library pywebcopy.

    For Current version: 5.0.1

    
    from pywebcopy import save_webpage
    
    url = 'http://some-site.com/some-page.html'
    download_folder = '/path/to/downloads/'    
    
    kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}
    
    save_webpage(url, download_folder, **kwargs)
    
    

    You will have html, css, js all at your download_folder. Completely working like original site.

    0 讨论(0)
  • 2020-12-09 18:59

    Try the Python library Scrapy. You can program Scrapy to recursively scan a website by downloading its pages, scanning, following links:

    An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

    0 讨论(0)
  • 2020-12-09 19:03

    The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

    import urllib2
    from BeautifulSoup import *
    from urlparse import urljoin
    
    
    def crawl(pages, depth=None):
        indexed_url = [] # a list for the main and sub-HTML websites in the main website
        for i in range(depth):
            for page in pages:
                if page not in indexed_url:
                    indexed_url.append(page)
                    try:
                        c = urllib2.urlopen(page)
                    except:
                        print "Could not open %s" % page
                        continue
                    soup = BeautifulSoup(c.read())
                    links = soup('a') #finding all the sub_links
                    for link in links:
                        if 'href' in dict(link.attrs):
                            url = urljoin(page, link['href'])
                            if url.find("'") != -1:
                                    continue
                            url = url.split('#')[0] 
                            if url[0:4] == 'http':
                                    indexed_url.append(url)
            pages = indexed_url
        return indexed_url
    
    
    pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
    urls = crawl(pagelist, depth=2)
    print urls
    

    Python3 version, 2019. May this saves some time to somebody:

    #!/usr/bin/env python
    
    
    import urllib.request as urllib2
    from bs4 import *
    from urllib.parse  import urljoin
    
    
    def crawl(pages, depth=None):
        indexed_url = [] # a list for the main and sub-HTML websites in the main website
        for i in range(depth):
            for page in pages:
                if page not in indexed_url:
                    indexed_url.append(page)
                    try:
                        c = urllib2.urlopen(page)
                    except:
                        print( "Could not open %s" % page)
                        continue
                    soup = BeautifulSoup(c.read())
                    links = soup('a') #finding all the sub_links
                    for link in links:
                        if 'href' in dict(link.attrs):
                            url = urljoin(page, link['href'])
                            if url.find("'") != -1:
                                    continue
                            url = url.split('#')[0] 
                            if url[0:4] == 'http':
                                    indexed_url.append(url)
            pages = indexed_url
        return indexed_url
    
    
    pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
    urls = crawl(pagelist, depth=1)
    print( urls )
    
    0 讨论(0)
  • 2020-12-09 19:10

    Using Python 3+ Requests and other standard libraries.

    The function savePage receives a requests.Response and the pagefilename where to save it.

    • Saves the pagefilename.html on the current folder
    • Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.
    • Any exception are printed on sys.stderr, returns a BeautifulSoup object .
    • Requests session must be a global variable unless someone writes a cleaner code here for us.

    You can adapt it to your needs.


    import os, sys
    import requests
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup
    
    def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'):
        if not os.path.exists(pagefolder): # create only once
            os.mkdir(pagefolder)
        for res in soup.findAll(tag2find):   # images, css, etc..
            try:
                filename = os.path.basename(res[inner])  
                fileurl = urljoin(url, res.get(inner))
                # rename to saved file path
                # res[inner] # may or may not exist 
                filepath = os.path.join(pagefolder, filename)
                res[inner] = os.path.join(os.path.basename(pagefolder), filename)
                if not os.path.isfile(filepath): # was not downloaded
                    with open(filepath, 'wb') as file:
                        filebin = session.get(fileurl)
                        file.write(filebin.content)
            except Exception as exc:      
                print(exc, file=sys.stderr)
        return soup
    
    def savePage(response, pagefilename='page'):    
       url = response.url
       soup = BeautifulSoup(response.text)
       pagefolder = pagefilename+'_files' # page contents 
       soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src')
       soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href')
       soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src')    
       with open(pagefilename+'.html', 'w') as file:
          file.write(soup.prettify())
       return soup
    

    Example saving google page and its contents (google_files folder)

    session = requests.Session()
    #... whatever requests config you need here
    response = session.get('https://www.google.com')
    savePage(response, 'google')
    
    0 讨论(0)
提交回复
热议问题