Behavior of the scrapy xpath selector on h1-h6 tags

后端未结

关注

 2  1676

Why does the following two code snippets give different outputs? The only difference between them is that the h1 tag in the first case is replaced with an

相关标签:

2条回答

一个人的身影

2021-01-16 20:28
Short answer is that h1..h6 should not contain <p> in well-formed HTML documents, at least lxml (which powers Scrapy Selectors) does not like that when parsing HTML. lxml does handle bad formatting, but this case it a bit different.

You can check how lxml parses and serializes back the HTML snippet:
```
>>> from scrapy import Selector
>>> text = '<h1><p>xxx</p></h1>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h1></h1><p>xxx</p></body></html>
```
So when lxml encounters the p tag within the h1, it puts it after it. The p element is not lost, but it's not where you'd expect it when reading the HTML source.

vs the other snippet:
```
>>> text = '<h><p>xxx</p></h>'
>>> s = Selector(text=text)
>>> print(s.extract())
<html><body><h><p>xxx</p></h></body></html>
>>> 
```
h elements do not mean anything special for lxml, so "p within h" is ok.
0 讨论(0)
发布评论:

提交评论
- 加载中...
余生分开走

2021-01-16 20:47
Including p tags inside h# is invalid according to W3C. You can see more about this here

Anyway, to bypass this and just work with any xml structure you can just change the type like this:
```
sel = Selector(text="anyxml", type="xml")
```
This will respect any xml structure.
0 讨论(0)
发布评论:

提交评论
- 加载中...