问题
I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:
curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
But how do I schedule all spiders in a project at once?
All help much appreciated!
回答1:
My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.
YOURPROJECTNAME/commands/allcrawl.py :
from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log
class AllCrawlCommand(ScrapyCommand):
requires_project = True
default_settings = {'LOG_ENABLED': False}
def short_desc(self):
return "Schedule a run for all available spiders"
def run(self, args, opts):
url = 'http://localhost:6800/schedule.json'
for s in self.crawler.spiders.list():
values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
log.msg(response)
Make sure to include the following in your settings.py
COMMANDS_MODULE = 'YOURPROJECTNAME.commands'
Then from the command line (in your project directory) you can simply type
scrapy allcrawl
回答2:
Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer.
From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.
来源:https://stackoverflow.com/questions/10801093/run-multiple-scrapy-spiders-at-once-using-scrapyd