Scrapy put two spiders in single file

前端 未结 2 1729
滥情空心
滥情空心 2021-01-14 05:51

I have written two spiders in single file. When I ran scrapy runspider two_spiders.py, only the first Spider was executed. How can I run both of them without sp

相关标签:
2条回答
  • 2021-01-14 05:53

    Let's read the documentation:

    Running multiple spiders in the same process

    By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.

    Here is an example that runs multiple spiders simultaneously:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start() # the script will block here until all crawling jobs are finished
    

    (there are few more examples in the documentation)

    From your question it is not clear how have you put two spiders into one file. It was not enough to concatenate content of two files with single spiders.

    Try to do what is written in the documentation. Or at least show us your code. Without it we can't help you.

    0 讨论(0)
  • 2021-01-14 06:18

    Here is a full Scrapy project with 2 spiders in one file.

    # quote_spiders.py
    import json
    import string
    
    import scrapy
    from scrapy.crawler import CrawlerProcess
    from scrapy.item import Item, Field
    
    
    class TextCleaningPipeline(object):
    
        def _clean_text(self, text):
            text = text.replace('“', '').replace('”', '')
            table = str.maketrans({key: None for key in string.punctuation})
            clean_text = text.translate(table)
            return clean_text.lower()
    
        def process_item(self, item, spider):
            item['text'] = self._clean_text(item['text'])
            return item
    
    
    class JsonWriterPipeline(object):
    
        def open_spider(self, spider):
            self.file = open(spider.settings['JSON_FILE'], 'a')
    
        def close_spider(self, spider):
            self.file.close()
    
        def process_item(self, item, spider):
            line = json.dumps(dict(item)) + "\n"
            self.file.write(line)
            return item
    
    
    class QuoteItem(Item):
        text = Field()
        author = Field()
        tags = Field()
        spider = Field()
    
    
    class QuotesSpiderOne(scrapy.Spider):
        name = "quotes1"
    
        def start_requests(self):
            urls = ['http://quotes.toscrape.com/page/1/', ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
    
        def parse(self, response):
            for quote in response.css('div.quote'):
                item = QuoteItem()
                item['text'] = quote.css('span.text::text').get()
                item['author'] = quote.css('small.author::text').get()
                item['tags'] = quote.css('div.tags a.tag::text').getall()
                item['spider'] = self.name
                yield item
    
    
    class QuotesSpiderTwo(scrapy.Spider):
        name = "quotes2"
    
        def start_requests(self):
            urls = ['http://quotes.toscrape.com/page/2/', ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
    
        def parse(self, response):
            for quote in response.css('div.quote'):
                item = QuoteItem()
                item['text'] = quote.css('span.text::text').get()
                item['author'] = quote.css('small.author::text').get()
                item['tags'] = quote.css('div.tags a.tag::text').getall()
                item['spider'] = self.name
                yield item
    
    
    if __name__ == '__main__':
        settings = dict()
        settings['USER_AGENT'] = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
        settings['HTTPCACHE_ENABLED'] = True
        settings['JSON_FILE'] = 'items.jl'
        settings['ITEM_PIPELINES'] = dict()
        settings['ITEM_PIPELINES']['__main__.TextCleaningPipeline'] = 800
        settings['ITEM_PIPELINES']['__main__.JsonWriterPipeline'] = 801
    
        process = CrawlerProcess(settings=settings)
        process.crawl(QuotesSpiderOne)
        process.crawl(QuotesSpiderTwo)
        process.start()
    
    

    Install Scrapy and run the script

    $ pip install Scrapy
    $ python quote_spiders.py 
    

    No other file is needed.

    This example coupled with graphical debugger of pycharm/vscode can help understand scrapy workflow and make debugging easier.

    0 讨论(0)
提交回复
热议问题