html5lib

网络爬虫05: BesutifulSoup库详解

 ̄綄美尐妖づ 提交于 2019-11-27 21:01:22
BeautifulSoup 1.什么是BeautifulSoup 灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取 2.安装BeautifulSoup pip3 install lxml pip3 install BeautifulSoup4 3.解析库 解析器 使用方法 优势 劣势 Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差 lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强 需要安装C语言库 lxml XML 解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持XML的解析器 需要安装C语言库 html5lib BeautifulSoup(markup, "html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展 基本使用 html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name=

Don't put html, head and body tags automatically, beautifulsoup

拜拜、爱过 提交于 2019-11-27 14:17:51
using beautifulsoup with html5lib, it puts the html, head and body tags automatically: BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html> is there any option that I can set, turn off this behavior ? In [35]: import bs4 as bs In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser") Out[36]: <h1>FOO</h1> This parses the HTML with Python's builtin HTML parser . Quoting the docs: Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag.

beautifulsoup, html5lib: module object has no attribute _base

蹲街弑〆低调 提交于 2019-11-27 10:25:19
问题 When I updated my packages I have this new error: class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): AttributeError: 'module' object has no attribute '_base' I tried to update beautifulsoup , with no more result. How can I fix that? 回答1: I upgraded beautifulsoup4 and html5lib and it resolved the issue. pip install --upgrade beautifulsoup4 pip install --upgrade html5lib 回答2: This is an issue with upstream package html5lib: https://bugs.launchpad.net/beautifulsoup/+bug

html5lib: TypeError: __init__() got an unexpected keyword argument 'encoding'

我是研究僧i 提交于 2019-11-26 22:05:17
问题 I'm trying to install html5lib . at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to try older verison (0.9999999, seven nines ). I installed it, but when I try to use it: >>> with urlopen("http://example.com/") as f: document = html5lib.parse(f, encoding=f.info().get_content_charset()) I get an error: Traceback (most recent call last): File "<pyshell#11>", line 2, in <module> document = html5lib.parse(f, encoding=f