How to download a full webpage with a Python script?

前端未结

关注

 4  1930

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page

相关标签:

4条回答

被撕碎了的回忆

2020-12-09 18:59

You can easily do that with simple python library pywebcopy.

For Current version: 5.0.1

from pywebcopy import save_webpage url = 'http://some-site.com/some-page.html' download_folder = '/path/to/downloads/' kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'} save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder. Completely working like original site.

0 讨论(0)

发布评论:

提交评论

加载中...

傲寒

2020-12-09 18:59

Try the Python library Scrapy. You can program Scrapy to recursively scan a website by downloading its pages, scanning, following links:

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

0 讨论(0)

发布评论:

提交评论

加载中...

说谎

2020-12-09 19:03

The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

import urllib2 from BeautifulSoup import * from urlparse import urljoin def crawl(pages, depth=None): indexed_url = [] # a list for the main and sub-HTML websites in the main website for i in range(depth): for page in pages: if page not in indexed_url: indexed_url.append(page) try: c = urllib2.urlopen(page) except: print "Could not open %s" % page continue soup = BeautifulSoup(c.read()) links = soup('a') #finding all the sub_links for link in links: if 'href' in dict(link.attrs): url = urljoin(page, link['href']) if url.find("'") != -1: continue url = url.split('#')[0] if url[0:4] == 'http': indexed_url.append(url) pages = indexed_url return indexed_url pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"] urls = crawl(pagelist, depth=2) print urls

Python3 version, 2019. May this saves some time to somebody:

#!/usr/bin/env python import urllib.request as urllib2 from bs4 import * from urllib.parse import urljoin def crawl(pages, depth=None): indexed_url = [] # a list for the main and sub-HTML websites in the main website for i in range(depth): for page in pages: if page not in indexed_url: indexed_url.append(page) try: c = urllib2.urlopen(page) except: print( "Could not open %s" % page) continue soup = BeautifulSoup(c.read()) links = soup('a') #finding all the sub_links for link in links: if 'href' in dict(link.attrs): url = urljoin(page, link['href']) if url.find("'") != -1: continue url = url.split('#')[0] if url[0:4] == 'http': indexed_url.append(url) pages = indexed_url return indexed_url pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"] urls = crawl(pagelist, depth=1) print( urls )

0 讨论(0)

发布评论:

提交评论

加载中...

庸人自扰

2020-12-09 19:10

Using Python 3+ Requests and other standard libraries.

The function savePage receives a requests.Response and the pagefilename where to save it.

Saves the pagefilename.html on the current folder

Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.

Any exception are printed on sys.stderr, returns a BeautifulSoup object .

Requests session must be a global variable unless someone writes a cleaner code here for us.

You can adapt it to your needs.

import os, sys import requests from urllib.parse import urljoin from bs4 import BeautifulSoup def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'): if not os.path.exists(pagefolder): # create only once os.mkdir(pagefolder) for res in soup.findAll(tag2find): # images, css, etc.. try: filename = os.path.basename(res[inner]) fileurl = urljoin(url, res.get(inner)) # rename to saved file path # res[inner] # may or may not exist filepath = os.path.join(pagefolder, filename) res[inner] = os.path.join(os.path.basename(pagefolder), filename) if not os.path.isfile(filepath): # was not downloaded with open(filepath, 'wb') as file: filebin = session.get(fileurl) file.write(filebin.content) except Exception as exc: print(exc, file=sys.stderr) return soup def savePage(response, pagefilename='page'): url = response.url soup = BeautifulSoup(response.text) pagefolder = pagefilename+'_files' # page contents soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src') soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href') soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src') with open(pagefilename+'.html', 'w') as file: file.write(soup.prettify()) return soup

Example saving google page and its contents (google_files folder)

session = requests.Session() #... whatever requests config you need here response = session.get('https://www.google.com') savePage(response, 'google')

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复