Constructing a regular expression for url in start_urls list in scrapy framework python

后端 未结 2 379
迷失自我
迷失自我 2021-01-15 22:20

I am very new to scrapy and also i didn\'t used regular expressions before

The following is my spider.py code

class ExampleSpider(BaseSp         


        
2条回答
  •  走了就别回头了
    2021-01-15 22:42

    If you are using CrawlSpider, it's not usually a good idea to override the parse method.

    Rule object can filter the urls you are interesed to the ones you do not care for.

    See CrawlSpider in the docs for reference.

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    import re
    
    class ExampleSpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com/bookstore']
    
        rules = (
            Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
        )
    
    def parse_boostore(self, response):
       hxs = HtmlXPathSelector(response)
    

提交回复
热议问题