Obtaining position info when parsing HTML in Python

依然范特西╮ 提交于 2019-11-29 10:59:11

After some additional research and more carefully reviewing of the source code of html5lib, I discovered that html5lib.tokenizer.HTMLTokenizer does retain partial position information. By "partial," I mean that it knows the line and column of the last character of a given token. Unfortunately, it does not retain the position of the start of the token (I suppose it could be extrapolated, but that feels like re-implementing much of the tokenizer in reverse--and no, using the end position of the previous won't work if there is white space between tokens).

In any event, I was able to wrap the HTMLTokenizer and create an HTMLParser clone which mostly replicates the API. You can find my work here: https://gist.github.com/waylan/7d5b7552078f1abc6fac.

However, as the tokenizer is only part of the parsing process implemented by html5lib, we loose the good parts of html5lib. For example, no normalization has been done at that stage in the process, so you get the raw (potentially invalid) tokens rather than a normalized document. As stated in the comments there, it is not perfect and I question whether it is even useful.

In fact, I also discovered the the HTMLParser included in the Python standard library had been updated for Python 3.3 and no longer crashes hard on invalid input. As far as I can tell, it is better (for my use case) in that it does provide actually useful position info (as it always has). In all other respects, it is no better or worse that my wrapper of html5lib (except of course, that it has presumably received much more testing and is therefore more stable). Unfortunately, the update has not been back-ported to Python 2 or earlier Python 3 versions. Although, I don't imagine that would be all that difficult to do myself.

In any event, I'v decided to move forward with HTMLParser in the standard library and reject my own wrapper around html5lib. You can see an early effort here which appears to work fine with minimal testing.


According to the Beautiful Soup docs, HTMLParser was updated to support invalid input in Python 2.7.3 and 3.2.2, which is earlier than 3.3.

Only sort of an answer — html5lib doesn't provide a streaming API because it's impossible to provide a streaming API while parsing HTML per spec in general without buffering or fatal errors (consider the input <table>xxx for example). It would, however, be nice to provide a streaming API for html5lib which used fatal errors only for those parse errors that prevent streaming. Not massively easy to implement, not massively difficult either.

It shouldn't be too much work to get location info into the tree in html5lib (the fact parse errors have location info makes it clear that it's possible to get!), and there's a couple of bugs on this, one general, and one specific to lxml.

Note that it's not possible to use the html5lib tokenizer alone to achieve this — the tokenizer has its state changed by the tree construction step at various points. You'd have to implement a minimal tree constructor (which would have to maintain a stack of open elements at least, though I think nothing more) to keep the tokenizer right, therefore. Once you want to start filtering based on current element you're basically needing the whole tree construction step so you're back to the streaming API problem above.

Interestingly, the HTMLParser class in the Python Standard Lib does offer support for obtaining the location info (with a getpos() method), but it is horrible at handling malformed HTML and has been eliminated as a possible solution.

A technique I've used before is to use BeautilfulSoup.prettify() to fix up malformed html and then parse that with HTMLParser.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!