How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

问题

I have two spiders which I want to execute in parallel. I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file, ie FEED_URI for each spider in the same process. I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution. If the first spider completes crawling before the second one, I get the desired output. However, if the second spider finishes crawling first, then it doesn't wait for the first spider to complete. How could I actually fix this?

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spider_loader.list():
    setting['FEED_FORMAT'] = 'json'
    setting['LOG_LEVEL'] = 'INFO'
    setting['FEED_URI'] = spider_name+'.json'
    setting['LOG_FILE'] = spider_name+'.log'
    process = CrawlerProcess(setting)
    print("Running spider %s" % spider_name)
    process.crawl(spider_name)

process.start()
print("Completed")

回答1:

According to scrapy docs using single CrawlerProcess for multiple spiders should look like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class Spider1(scrapy.Spider):
    ...

class Spider2(scrapy.Spider):
    ...

process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

setting.. settings on per spider basis can be done using custom_settings spider attribute

Scrapy has a group of modules that can't be set on per spider basis (only per CrawlerProcess).

modules that using Logging, SpiderLoader and twisted Reactor related settings - already initialized before Scrapy read spider custom_settings.

When you call scrapy crawl .... from command line tool - in fact you create single CrawlerProcess for single spider defined on command args.

process terminates as soon as the second spider completes execution.

If you used the same spider versions previously launched by scrapy crawl... this is not expected.

来源：https://stackoverflow.com/questions/62442491/how-to-specify-different-process-settings-for-two-different-spiders-in-crawlerpr

标签

python-3.x

web-scraping

scrapy

web-crawler