Constructing a regular expression for url in start_urls list in scrapy framework python

后端 未结 2 378
迷失自我
迷失自我 2021-01-15 22:20

I am very new to scrapy and also i didn\'t used regular expressions before

The following is my spider.py code

class ExampleSpider(BaseSp         


        
相关标签:
2条回答
  • 2021-01-15 22:32

    If i understand you correctly, you want a lot of start URL with a certain pattern.

    If so, you can override BaseSpider.start_requests method:

    class ExampleSpider(BaseSpider):
        name = "test_code"
        allowed_domains = ["www.example.com"]
    
        def start_requests(self):
            for i in xrange(1000):
                yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)
    
        ...
    
    0 讨论(0)
  • 2021-01-15 22:42

    If you are using CrawlSpider, it's not usually a good idea to override the parse method.

    Rule object can filter the urls you are interesed to the ones you do not care for.

    See CrawlSpider in the docs for reference.

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    import re
    
    class ExampleSpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com/bookstore']
    
        rules = (
            Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
        )
    
    def parse_boostore(self, response):
       hxs = HtmlXPathSelector(response)
    
    0 讨论(0)
提交回复
热议问题