问题
The lxml html5parser seems to ignore any namespaceHTMLElements=False
option I pass to it. It puts all elements I give it into the HTML namespace instead of the (expected) void namespace.
Here’s a simple case that reproduces the problem:
echo "<p>" | python -c "from sys import stdin; \
from lxml.html import html5parser as h5, tostring; \
print tostring(h5.parse(stdin, h5.HTMLParser(namespaceHTMLElements=False)))"
The output from that is this:
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head></html:head><html:body><html:p>
</html:p></html:body></html:html>
As can be seen, the html
element and all other elements there are in the HTML namespace.
The expected output is instead this:
<html><head></head><body><p>
</p></body></html>
I recognize that namespaceHTMLElements
is an html5lib option, not a native lxml option that lxml does anything itself with directly. lxml is supposed to just call html5lib and pass that option on to html5lib in such a way that html5lib uses it as expected.
Update 2016-02-17
I still haven’t found a way to get the lxml html5parser to honor the namespaceHTMLElements
. But to be clear, the alternative is to instead just call html5lib directly, like this:
echo "<p>" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print html.tostring(doc)"
More details
Some things I already know:
- html5lib fully conforms to the requirements of the HTML spec, including the requirement that the html element must be placed into the HTML namespace—which html5lib does
- However, html5lib provides
namespaceHTMLElements=False
as an option to override that default “put thehtml
element into the HTML namespace” behavior. - When I use html5lib directly (not through lxml), and I pass
namespaceHTMLElements=False
to it, everything works as expected—thehtml
element goes into the void namespace. Hacking some printf into the html5lib sources, I observe that:
- lxml is actually calling html5lib with
namespaceHTMLElements=False
as expected - but, lxml seems to be calling into html5lib twice: first without
namespaceHTMLElements
, then a second time withnamespaceHTMLElements=False
- lxml is actually calling html5lib with
Conclusion about where the cause is to be found
Given the above, it’s clear that the problem is in the interface between lxml and html5lib. I’m not sure why lxml is calling into html5lib twice but I think it may be because for some reason it first tries to create a new instance of its own XHTMLParser
before doing what I’m actually asking it to do, which is just to create an instance of its own HTMLParser
.
So maybe the fact that it does make two calls to html5lib causes html5lib to sort of “lock in” the default namespaceHTMLElements=True
behavior that results from the first call, and then ignore the namespaceHTMLElements=False
directive when it sees it in the second call.
Maybe in making two calls the way it does, lxml is either breaking some assumption in html5lib, or is actually misusing the html5lib API in a way that it by design is not intended to be used.
Or maybe the cause isn’t at all the result of lxml making two separate calls to html5lib, but instead some other problem in the way it’s using the html5lib interface.
Anyway, I’m interested in hearing from others about whether anybody else has run into this problem and has a workaround—or at least have some insight into why it’s happening.
回答1:
I have followed in the source-code, how lxml hands params to html5lib. Most of the functions have a finishing *kws, which is then handed to the next function. In one of the last steps when calling the actual html5 parser, this is dropped and the parser is called with 2 fixed params.
(I had the same problem yesterday, and just got to this question, and forgot the tiny details, allow me to forgo any code-snippets, and references.)
Anyway, this confirms that in 2018, calling the html5lib directly with is still the preferred way, if calling lxml's own parser is not an option for some reason.
(My use-case was: parse crappy html and have xpath.)
来源:https://stackoverflow.com/questions/32731479/lxml-html5parser-ignores-namespacehtmlelements-false-option