发表新帖

发表新帖

Is it possible for Scrapy to get plain text from raw HTML data?

后端未结

关注

 3  807

悲&欢浪女 2021-02-12 17:27

For example:

scrapy shell http://scrapy.org/
content = hxs.select(\'//*[@id=\"content\"]\').extract()[0]
print content

Then, I get the followin

3条回答

星月不相逢 (楼主)

2021-02-12 18:09
Scrapy doesn't have such functionality built-in. html2text is what you are looking for.

Here's a sample spider that scrapes wikipedia's python page, gets first paragraph using xpath and converts html into plain text using html2text:
```
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text


class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]

        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print(converter.handle(sample)) #Python 3 print syntax
```
prints:

**Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题