Locally run all of the spiders in Scrapy

后端 未结 4 1581
忘掉有多难
忘掉有多难 2020-12-03 03:27

Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with scrapy crawl,

相关标签:
4条回答
  • Here is an example that does not run inside a custom command, but runs the Reactor manually and creates a new Crawler for each spider:

    from twisted.internet import reactor
    from scrapy.crawler import Crawler
    # scrapy.conf.settings singlton was deprecated last year
    from scrapy.utils.project import get_project_settings
    from scrapy import log
    
    def setup_crawler(spider_name):
        crawler = Crawler(settings)
        crawler.configure()
        spider = crawler.spiders.create(spider_name)
        crawler.crawl(spider)
        crawler.start()
    
    log.start()
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.configure()
    
    for spider_name in crawler.spiders.list():
        setup_crawler(spider_name)
    
    reactor.run()
    

    You will have to design some signal system to stop the reactor when all spiders are finished.

    EDIT: And here is how you can run multiple spiders in a custom command:

    from scrapy.command import ScrapyCommand
    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import Crawler
    
    class Command(ScrapyCommand):
    
        requires_project = True
    
        def syntax(self):
            return '[options]'
    
        def short_desc(self):
            return 'Runs all of the spiders'
    
        def run(self, args, opts):
            settings = get_project_settings()
    
            for spider_name in self.crawler.spiders.list():
                crawler = Crawler(settings)
                crawler.configure()
                spider = crawler.spiders.create(spider_name)
                crawler.crawl(spider)
                crawler.start()
    
            self.crawler.start()
    
    0 讨论(0)
  • 2020-12-03 04:01

    this code is works on My scrapy version is 1.3.3 (save it in same directory in scrapy.cfg):

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    setting = get_project_settings()
    process = CrawlerProcess(setting)
    
    for spider_name in process.spiders.list():
        print ("Running spider %s" % (spider_name))
        process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
    
    process.start()
    

    for scrapy 1.5.x (so you don't get the deprecation warning)

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    setting = get_project_settings()
    process = CrawlerProcess(setting)
    
    for spider_name in process.spider_loader.list():
        print ("Running spider %s" % (spider_name))
        process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
    
    process.start()
    
    0 讨论(0)
  • 2020-12-03 04:08

    the answer of @Steven Almeroth will be failed in Scrapy 1.0, and you should edit the script like this:

    from scrapy.commands import ScrapyCommand
    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    class Command(ScrapyCommand):
    
        requires_project = True
        excludes = ['spider1']
    
        def syntax(self):
            return '[options]'
    
        def short_desc(self):
            return 'Runs all of the spiders'
    
        def run(self, args, opts):
            settings = get_project_settings()
            crawler_process = CrawlerProcess(settings) 
    
            for spider_name in crawler_process.spider_loader.list():
                if spider_name in self.excludes:
                    continue
                spider_cls = crawler_process.spider_loader.load(spider_name) 
                crawler_process.crawl(spider_cls)
            crawler_process.start()
    
    0 讨论(0)
  • 2020-12-03 04:12

    Why didn't you just use something like:

    scrapy list|xargs -n 1 scrapy crawl
    

    ?

    0 讨论(0)
提交回复
热议问题