Scrapy - Continuously fetch urls to crawl from database

前端 未结 1 731
日久生厌
日久生厌 2021-01-03 08:07

I\'d like to continuously fetch urls to crawl from a database. So far I succeeded in fetching urls from the base but I\'d like my spider to keep reading from that base since

相关标签:
1条回答
  • 2021-01-03 08:36

    I would personally recommend to start a new spider every time you have to crawl something but if you want to keep the process alive I would recommend using the spider_idle signal:

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        crawler.signals.connect(spider.spider_idle, signals.spider_idle)
        return spider
    ...
    def spider_idle(self, spider):
        # read database again and send new requests
    
        # check that sending new requests here is different
        self.crawler.engine.crawl(
                        Request(
                            new_url,
                            callback=self.parse),
                        spider
                    )
    

    Here you are sending new requests before the spider actually closes.

    0 讨论(0)
提交回复
热议问题