Locally run all of the spiders in Scrapy

后端未结

关注

 4  1581

Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with scrapy crawl,

相关标签:

4条回答

不要未来只要你来

2020-12-03 04:01

Here is an example that does not run inside a custom command, but runs the Reactor manually and creates a new Crawler for each spider:

from twisted.internet import reactor
from scrapy.crawler import Crawler
# scrapy.conf.settings singlton was deprecated last year
from scrapy.utils.project import get_project_settings
from scrapy import log

def setup_crawler(spider_name):
    crawler = Crawler(settings)
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()

log.start()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()

for spider_name in crawler.spiders.list():
    setup_crawler(spider_name)

reactor.run()

You will have to design some signal system to stop the reactor when all spiders are finished.

EDIT: And here is how you can run multiple spiders in a custom command:

from scrapy.command import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import Crawler

class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        settings = get_project_settings()

        for spider_name in self.crawler.spiders.list():
            crawler = Crawler(settings)
            crawler.configure()
            spider = crawler.spiders.create(spider_name)
            crawler.crawl(spider)
            crawler.start()

        self.crawler.start()

0 讨论(0)

慢半拍i

2020-12-03 04:01

this code is works on My scrapy version is 1.3.3 (save it in same directory in scrapy.cfg):

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

for scrapy 1.5.x (so you don't get the deprecation warning)

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spider_loader.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

0 讨论(0)

北荒

2020-12-03 04:08

the answer of @Steven Almeroth will be failed in Scrapy 1.0, and you should edit the script like this:

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

class Command(ScrapyCommand):

    requires_project = True
    excludes = ['spider1']

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        settings = get_project_settings()
        crawler_process = CrawlerProcess(settings) 

        for spider_name in crawler_process.spider_loader.list():
            if spider_name in self.excludes:
                continue
            spider_cls = crawler_process.spider_loader.load(spider_name) 
            crawler_process.crawl(spider_cls)
        crawler_process.start()

0 讨论(0)

夕颜

2020-12-03 04:12
Why didn't you just use something like:
```
scrapy list|xargs -n 1 scrapy crawl
```
?
0 讨论(0)
发布评论:

提交评论
- 加载中...