How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

后端 未结 1 676
一向
一向 2020-12-02 21:28

We have a system written with scrapy to crawl a few websites. There are several spiders, and a few cascaded pipelines for all items passed

相关标签:
1条回答
  • 2020-12-02 21:57

    You can raise a CloseSpider exception to close down a spider. However, I don't think this will work from a pipeline.

    EDIT: avaleske notes in the comments to this answer that he was able to raise a CloseSpider exception from a pipeline. Most wise would be to use this.

    A similar situation has been described on the Scrapy Users group, in this thread.

    I quote:

    To close an spider for any part of your code you should use engine.close_spider method. See this extension for an usage example: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/closespider.py#L61

    You could write your own extension, whilst looking at closespider.py as an example, which will shut down a spider if a certain condition has been met.

    Another "hack" would be to set a flag on the spider in the pipeline. For example:

    pipeline:

    def process_item(self, item, spider):
        if some_flag:
            spider.close_down = True
    

    spider:

    def parse(self, response):
        if self.close_down:
            raise CloseSpider(reason='API usage exceeded')
    
    0 讨论(0)
提交回复
热议问题