html5lib

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

≡放荡痞女 提交于 2019-12-18 11:46:48
问题 I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table: <html> <table> <tr><td>Header</td></tr> <tr><td>Want This</td></tr> </table> </html> so lets try it: >>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml') >>> doc <lxml

Don't put html, head and body tags automatically, beautifulsoup

倖福魔咒の 提交于 2019-12-17 11:24:11
问题 using beautifulsoup with html5lib, it puts the html, head and body tags automatically: BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html> is there any option that I can set, turn off this behavior ? 回答1: In [35]: import bs4 as bs In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser") Out[36]: <h1>FOO</h1> This parses the HTML with Python's builtin HTML parser. Quoting the docs: Unlike html5lib, this parser makes no attempt to create a well

Can't open html5lib in Python

断了今生、忘了曾经 提交于 2019-12-14 03:51:20
问题 I just installed html5lib for Python with Windows Command Prompt. The package was installed here: File "C:\Python27\lib\site-packages\html5lib However, if I try to import html5lib: #! /usr/bin/python import html5lib I get the following error: Traceback (most recent call last): File "C:\Users\workspace\testhtml5\src\test.py", line 2, in <module> import html5lib File "C:\Python27\lib\site-packages\html5lib\__init__.py", line 16, in <module> from .html5parser import HTMLParser, parse,

Validate an HTML fragment using html5lib

…衆ロ難τιáo~ 提交于 2019-12-11 10:38:40
问题 I'm using Python and html5lib to check if a bit of HTML code entered on a form field is valid. I tried the following code to test a valid fragment but I'm getting an unexpected error (at least for me): >>> import html5lib >>> from html5lib.filters import lint >>> fragment = html5lib.parseFragment('<p><script>alert("Boo!")</script></p>') >>> walker = html5lib.getTreeWalker('etree') >>> [i for i in lint.Filter(walker(fragment))] Traceback (most recent call last): File "<console>", line 1, in

incompatible numpy and html5lib for tensorflow

▼魔方 西西 提交于 2019-12-11 04:47:41
问题 tensorflow 1.7.0 has requirement numpy>=1.13.3, but you'll have numpy 1.11.0 which is incompatible. tensorboard 1.7.0 has requirement html5lib==0.9999999, but you'll have html5lib 0.999 which is incompatible. tensorboard 1.7.0 has requirement numpy>=1.12.0, but you'll have numpy 1.11.0 which is incompatible. Please refer to this screenshot Why are these messages showing up...even though I have the proper versions installed??I have upgraded and reinstalled them over and over. I also

Remove contents of <style>…</style> tags using html5lib or bleach

天涯浪子 提交于 2019-12-10 10:37:15
问题 I've been using the excellent bleach library for removing bad HTML. I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like: <STYLE> st1:*{behavior:url(#ieooui) } </STYLE> Using bleach (with the style tag implicitly disallowed), leaves me with: st1:*{behavior:url(#ieooui) } Which isn't helpful. Bleach seems only to have options to: Escape tags; Remove the tags (but not their contents). I'm looking for a third option - remove the tags and their

parse any HTML to XML using html5lib

倾然丶 夕夏残阳落幕 提交于 2019-12-08 17:20:43
I need to tidy up HTML pages and convert them to XML in Python; losing some "bad" parts if needed. I used TagSoup for some time, but it doesn't understand new "article", "footer" tags, and doesn't like "meta" when they are not in the head; making resulting XML almost impossible to process. I like what html5lib does so far, but my fifth test (very weird tests) failed; when parsing <div attr="val""> using html5lib + xml.dom treebuilder, I got the following in the resulting XML string: <div attr="val" "=""> which is not a good result for well-formed xml. When I tried html5lib + lxml as a

html5lib makes BeautifulSoup miss an element

倖福魔咒の 提交于 2019-12-08 08:43:02
问题 Contiuing my attempt to pull transcripts from the Presidential debates, I've no started using html5lib as a parser with BeautifulSoup. But, now when I run (previously working) code to find the element with the actual transcript it errors out and claims not to find any such span. Here's the code: from bs4 import BeautifulSoup import html5lib import urllib file = urllib.urlopen('http://www.presidency.ucsb.edu/ws/index.php?pid=111395') soup = BeautifulSoup(file, "html5lib") transcript = soup

parse any HTML to XML using html5lib

岁酱吖の 提交于 2019-12-08 04:48:28
问题 I need to tidy up HTML pages and convert them to XML in Python; losing some "bad" parts if needed. I used TagSoup for some time, but it doesn't understand new "article", "footer" tags, and doesn't like "meta" when they are not in the head; making resulting XML almost impossible to process. I like what html5lib does so far, but my fifth test (very weird tests) failed; when parsing <div attr="val""> using html5lib + xml.dom treebuilder, I got the following in the resulting XML string: <div attr

How to remove namespace value from inside lxml.html.html5paser element tag

狂风中的少年 提交于 2019-12-08 04:32:09
问题 Is it possible not to add namespace for the tag when using html5parser from the lxml.html package? Example: from lxml import html print(html.parse('http://example.com').getroot().tag) # You will get 'html' from lxml.html import html5parser print(html5parser.parse('http://example.com').getroot().tag) # You will get '{http://www.w3.org/1999/xhtml}html' The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all? 回答1: There is a specific