Behavior of the scrapy xpath selector on h1-h6 tags

后端 未结 2 1676
逝去的感伤
逝去的感伤 2021-01-16 19:59

Why does the following two code snippets give different outputs? The only difference between them is that the h1 tag in the first case is replaced with an

相关标签:
2条回答
  • 2021-01-16 20:28

    Short answer is that h1..h6 should not contain <p> in well-formed HTML documents, at least lxml (which powers Scrapy Selectors) does not like that when parsing HTML. lxml does handle bad formatting, but this case it a bit different.

    You can check how lxml parses and serializes back the HTML snippet:

    >>> from scrapy import Selector
    >>> text = '<h1><p>xxx</p></h1>'
    >>> s = Selector(text=text)
    >>> print(s.extract())
    <html><body><h1></h1><p>xxx</p></body></html>
    

    So when lxml encounters the p tag within the h1, it puts it after it. The p element is not lost, but it's not where you'd expect it when reading the HTML source.

    vs the other snippet:

    >>> text = '<h><p>xxx</p></h>'
    >>> s = Selector(text=text)
    >>> print(s.extract())
    <html><body><h><p>xxx</p></h></body></html>
    >>> 
    

    h elements do not mean anything special for lxml, so "p within h" is ok.

    0 讨论(0)
  • 2021-01-16 20:47

    Including p tags inside h# is invalid according to W3C. You can see more about this here

    Anyway, to bypass this and just work with any xml structure you can just change the type like this:

    sel = Selector(text="anyxml", type="xml")
    

    This will respect any xml structure.

    0 讨论(0)
提交回复
热议问题