How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

房东的猫 提交于 2021-01-28 16:42:05

问题


I have two spiders which I want to execute in parallel. I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file, ie FEED_URI for each spider in the same process. I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution. If the first spider completes crawling before the second one, I get the desired output. However, if the second spider finishes crawling first, then it doesn't wait for the first spider to complete. How could I actually fix this?

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spider_loader.list():
    setting['FEED_FORMAT'] = 'json'
    setting['LOG_LEVEL'] = 'INFO'
    setting['FEED_URI'] = spider_name+'.json'
    setting['LOG_FILE'] = spider_name+'.log'
    process = CrawlerProcess(setting)
    print("Running spider %s" % spider_name)
    process.crawl(spider_name)

process.start()
print("Completed")

回答1:


According to scrapy docs using single CrawlerProcess for multiple spiders should look like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class Spider1(scrapy.Spider):
    ...

class Spider2(scrapy.Spider):
    ...

process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

setting.. settings on per spider basis can be done using custom_settings spider attribute

Scrapy has a group of modules that can't be set on per spider basis (only per CrawlerProcess).

modules that using Logging, SpiderLoader and twisted Reactor related settings - already initialized before Scrapy read spider custom_settings.

When you call scrapy crawl .... from command line tool - in fact you create single CrawlerProcess for single spider defined on command args.

process terminates as soon as the second spider completes execution.

If you used the same spider versions previously launched by scrapy crawl... this is not expected.



来源:https://stackoverflow.com/questions/62442491/how-to-specify-different-process-settings-for-two-different-spiders-in-crawlerpr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!