Scrapy run multiple spiders from a main spider?

问题

I have two spiders that take urls and data scraped by a main spider. My approach to this was to use CrawlerProcess in the main spider and passing data to the two spiders. Here's my approach:

class LightnovelSpider(scrapy.Spider):

    name = "novelDetail"
    allowed_domains = ["readlightnovel.com"]

    def __init__(self,novels = []):
        self.novels = novels

    def start_requests(self):
        for novel in self.novels:
            self.logger.info(novel)
            request = scrapy.Request(novel, callback=self.parseNovel)
            yield request

    def parseNovel(self, response):
        #stuff here

class chapterSpider(scrapy.Spider):
    name = "chapters"
    #not done here

class initCrawler(scrapy.Spider):
    name = "main"
    fromMongo = {}
    toChapter = {}
    toNovel = []
    fromScraper = []


    def start_requests(self):
        urls = ['http://www.readlightnovel.com/novel-list']

        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self,response):

        for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
            initCrawler.fromScraper.append(novel)

        self.checkchanged()

    def checkchanged(self):
        #some scraped data processing here
        self.dispatchSpiders()

    def dispatchSpiders(self):
        process = CrawlerProcess()
        novelSpider = LightnovelSpider()
        process.crawl(novelSpider,novels=initCrawler.toNovel)
        process.start()
        self.logger.info("Main Spider Finished")

I run "scrapy crawl main" and get a beautiful error

The main error i can see is a "twisted.internet.error.ReactorAlreadyRunning" . Which i have no idea about. Are there better approaches running multiple spiders from another and/or how can i stop this error?

回答1:

Wow, didn't know something like this could work, but I never tried.

What I'm doing instead when multiple scraping stages have to work hand in hand is either one of these two options:

Option 1 - Use a database

When the scrapers have to run in a continuous mode, rescanning sites etc, I would just make the scrapers push their results into a database (through a pipline)

And also the spiders that do the subsequent processing would pull the data they need from the same database (in your case the novel urls for example).

Then keep everything running using a scheduler or cron and the spiders will work hand in hand.

Option 2 - Merging everything into one spider

That's the way I choose when everything needs to be run as one piece of script: I create one spider that chains multiple request steps together.

class LightnovelSpider(scrapy.Spider):

    name = "novels"
    allowed_domains = ["readlightnovel.com"]

    # was initCrawler.start_requests
    def start_requests(self):
        urls = ['http://www.readlightnovel.com/novel-list']

        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse_novel_list)

    # a mix of initCrawler.parse and parts of LightnovelScraper.start_requests
    def parse_novel_list(self,response):
        for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
            yield Request(novel, callback=self.parse_novel)

    def parse_novel(self, response):
        #stuff here
        # ... and create requests with callback=self.parse_chapters

    def parse_chapters(self, response):
        # do stuff

(code is not tested, it's just to show the basic idea)

If things get too complex I pull out some elements and move them into mixin classes.

In your case I would most probably prefer option 2.

回答2:

After a some research i was able to solve this problem by using a property decorator "@property" to retrieve data from main spider like this:

class initCrawler(scrapy.Spider):

    #stuff here from question

    @property
    def getNovel(self):
        return self.toNovel

    @property
    def getChapter(self):
        return self.toChapter

Then used CrawlerRunner like this:

from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler
from scrapy.crawler import CrawlerProcess,CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
import logging

configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(initCrawler)
    toNovel = initCrawler.toNovel
    toChapter = initCrawler.toChapter
    yield runner.crawl(chapterSpider,chapters=toChapter)
    yield runner.crawl(lightnovelSpider,novels=toNovel)

    reactor.stop()

crawl()
reactor.run()

来源：https://stackoverflow.com/questions/43425666/scrapy-run-multiple-spiders-from-a-main-spider

标签

scrapy

scrapy-spider