Why does the following two code snippets give different outputs? The only difference between them is that the h1
tag in the first case is replaced with an
Short answer is that h1
..h6
should not contain <p>
in well-formed HTML documents, at least lxml (which powers Scrapy Selectors) does not like that when parsing HTML. lxml does handle bad formatting, but this case it a bit different.
You can check how lxml parses and serializes back the HTML snippet:
>>> from scrapy import Selector
>>> text = '<h1><p>xxx</p></h1>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h1></h1><p>xxx</p></body></html>
So when lxml encounters the p
tag within the h1
, it puts it after it. The p
element is not lost, but it's not where you'd expect it when reading the HTML source.
vs the other snippet:
>>> text = '<h><p>xxx</p></h>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h><p>xxx</p></h></body></html>
>>>
h
elements do not mean anything special for lxml, so "p
within h
" is ok.
Including p
tags inside h#
is invalid according to W3C. You can see more about this here
Anyway, to bypass this and just work with any xml
structure you can just change the type
like this:
sel = Selector(text="anyxml", type="xml")
This will respect any xml structure.