问题
I have two spiders that take urls and data scraped by a main spider. My approach to this was to use CrawlerProcess in the main spider and passing data to the two spiders. Here's my approach:
class LightnovelSpider(scrapy.Spider):
name = "novelDetail"
allowed_domains = ["readlightnovel.com"]
def __init__(self,novels = []):
self.novels = novels
def start_requests(self):
for novel in self.novels:
self.logger.info(novel)
request = scrapy.Request(novel, callback=self.parseNovel)
yield request
def parseNovel(self, response):
#stuff here
class chapterSpider(scrapy.Spider):
name = "chapters"
#not done here
class initCrawler(scrapy.Spider):
name = "main"
fromMongo = {}
toChapter = {}
toNovel = []
fromScraper = []
def start_requests(self):
urls = ['http://www.readlightnovel.com/novel-list']
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self,response):
for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
initCrawler.fromScraper.append(novel)
self.checkchanged()
def checkchanged(self):
#some scraped data processing here
self.dispatchSpiders()
def dispatchSpiders(self):
process = CrawlerProcess()
novelSpider = LightnovelSpider()
process.crawl(novelSpider,novels=initCrawler.toNovel)
process.start()
self.logger.info("Main Spider Finished")
I run "scrapy crawl main" and get a beautiful error
The main error i can see is a "twisted.internet.error.ReactorAlreadyRunning" . Which i have no idea about. Are there better approaches running multiple spiders from another and/or how can i stop this error?
回答1:
Wow, didn't know something like this could work, but I never tried.
What I'm doing instead when multiple scraping stages have to work hand in hand is either one of these two options:
Option 1 - Use a database
When the scrapers have to run in a continuous mode, rescanning sites etc, I would just make the scrapers push their results into a database (through a pipline)
And also the spiders that do the subsequent processing would pull the data they need from the same database (in your case the novel urls for example).
Then keep everything running using a scheduler or cron and the spiders will work hand in hand.
Option 2 - Merging everything into one spider
That's the way I choose when everything needs to be run as one piece of script: I create one spider that chains multiple request steps together.
class LightnovelSpider(scrapy.Spider):
name = "novels"
allowed_domains = ["readlightnovel.com"]
# was initCrawler.start_requests
def start_requests(self):
urls = ['http://www.readlightnovel.com/novel-list']
for url in urls:
yield scrapy.Request(url=url,callback=self.parse_novel_list)
# a mix of initCrawler.parse and parts of LightnovelScraper.start_requests
def parse_novel_list(self,response):
for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
yield Request(novel, callback=self.parse_novel)
def parse_novel(self, response):
#stuff here
# ... and create requests with callback=self.parse_chapters
def parse_chapters(self, response):
# do stuff
(code is not tested, it's just to show the basic idea)
If things get too complex I pull out some elements and move them into mixin classes.
In your case I would most probably prefer option 2.
回答2:
After a some research i was able to solve this problem by using a property decorator "@property" to retrieve data from main spider like this:
class initCrawler(scrapy.Spider):
#stuff here from question
@property
def getNovel(self):
return self.toNovel
@property
def getChapter(self):
return self.toChapter
Then used CrawlerRunner like this:
from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler
from scrapy.crawler import CrawlerProcess,CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
import logging
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(initCrawler)
toNovel = initCrawler.toNovel
toChapter = initCrawler.toChapter
yield runner.crawl(chapterSpider,chapters=toChapter)
yield runner.crawl(lightnovelSpider,novels=toNovel)
reactor.stop()
crawl()
reactor.run()
来源:https://stackoverflow.com/questions/43425666/scrapy-run-multiple-spiders-from-a-main-spider