Scrapy: Extract commented (hidden) content

后端 未结 2 1846
旧巷少年郎
旧巷少年郎 2021-01-03 06:07

How can I extract content from within commented tags with scrappy ?

For instance, how to extract \"Yellow\" in the following example:

相关标签:
2条回答
  • 2021-01-03 06:24

    First of all, use below xpath to get all the comments from the page.

    data = response.xpath('//comment()').extract()
    

    Now, using any key value identity your meaning comments.

    up_data = []
    for d in data:
        if 'key' in d:
            up_data.append(d)
    

    define,

    html_template = '<html><body>%s</body></html>'
    for up_d in up_data:
        up_d = html_template % up_d.replace('<!--','').replace('-->', '')
        sel = Selector(text=up_d)
        sel.xpath('//div[@class="table_outer_container"]')
    
        // DO what you want
    
    0 讨论(0)
  • 2021-01-03 06:26

    You can use an XPath expression like //comment() to get the comment content, and then parse that content after having stripped the comment tags.

    Example scrapy shell session:

    paul@wheezy:~$ scrapy shell 
    ...
    In [1]: doc = """<div class="fruit">
       ...:     <div class="infos">
       ...:         <h2 class="Name">Banana</h2>
       ...:         <span class="edible">Edible: Yes</span>
       ...:     </div>
       ...:     <!--
       ...:     <p class="color">Yellow</p>
       ...:     -->
       ...: </div>"""
    
    In [2]: from scrapy.selector import Selector
    
    In [4]: selector = Selector(text=doc, type="html")
    
    In [5]: import re
    
    In [6]: regex = re.compile(r'<!--(.*)-->', re.DOTALL)
    
    In [7]: selector.xpath('//comment()').re(regex)
    Out[7]: [u'\n    <p class="color">Yellow</p>\n    ']
    
    In [8]: comment = selector.xpath('//comment()').re(regex)[0]
    
    In [9]: commentsel = Selector(text=comment, type="html")
    
    In [10]: commentsel.css('p.color')
    Out[10]: [<Selector xpath=u"descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' color ')]" data=u'<p class="color">Yellow</p>'>]
    
    In [11]: commentsel.css('p.color').extract()
    Out[11]: [u'<p class="color">Yellow</p>']
    
    In [12]: commentsel.css('p.color::text').extract()
    Out[12]: [u'Yellow']
    
    0 讨论(0)
提交回复
热议问题