Is it possible for Scrapy to get plain text from raw HTML data?

后端 未结 3 808
悲&欢浪女
悲&欢浪女 2021-02-12 17:27

For example:

scrapy shell http://scrapy.org/
content = hxs.select(\'//*[@id=\"content\"]\').extract()[0]
print content

Then, I get the followin

相关标签:
3条回答
  • 2021-02-12 18:06

    At this moment, I don't think you need to install any 3rd party library. scrapy provides this functionality using selectors:
    Assume this complex selector:

    sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
    

    we can get the entire text using:

    text_content = sel.xpath("//a[1]//text()").extract()
    # which results [u'Click here to go to the ', u'Next Page']
    

    then you can join them together easily:

       ' '.join(text_content)
       # Click here to go to the Next Page
    
    0 讨论(0)
  • 2021-02-12 18:09

    Scrapy doesn't have such functionality built-in. html2text is what you are looking for.

    Here's a sample spider that scrapes wikipedia's python page, gets first paragraph using xpath and converts html into plain text using html2text:

    from scrapy.selector import HtmlXPathSelector
    from scrapy.spider import BaseSpider
    import html2text
    
    
    class WikiSpider(BaseSpider):
        name = "wiki_spider"
        allowed_domains = ["www.wikipedia.org"]
        start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]
    
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]
    
            converter = html2text.HTML2Text()
            converter.ignore_links = True
            print(converter.handle(sample)) #Python 3 print syntax
    

    prints:

    **Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]

    0 讨论(0)
  • 2021-02-12 18:15

    Another solution using lxml.html's tostring() with parameter method="text". lxml is used in Scrapy internally. (parameter encoding=unicode is usually what you want.)

    See http://lxml.de/api/lxml.html-module.html for details.

    from scrapy.spider import BaseSpider
    import lxml.etree
    import lxml.html
    
    class WikiSpider(BaseSpider):
        name = "wiki_spider"
        allowed_domains = ["www.wikipedia.org"]
        start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]
    
        def parse(self, response):
            root = lxml.html.fromstring(response.body)
    
            # optionally remove tags that are not usually rendered in browsers
            # javascript, HTML/HEAD, comments, add the tag names you dont want at the end
            lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")
    
            # complete text
            print lxml.html.tostring(root, method="text", encoding=unicode)
    
            # or same as in alecxe's example spider,
            # pinpoint a part of the document using XPath
            #for p in root.xpath("//div[@id='mw-content-text']/p[1]"):
            #   print lxml.html.tostring(p, method="text")
    
    0 讨论(0)
提交回复
热议问题