问题
I have two spiders which I want to execute in parallel. I used the CrawlerProcess
instance and its crawl
method to acheieve this. However, I want to specify different output file, ie FEED_URI
for each spider in the same process. I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution. If the first spider completes crawling before the second one, I get the desired output. However, if the second spider finishes crawling first, then it doesn't wait for the first spider to complete. How could I actually fix this?
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spider_loader.list():
setting['FEED_FORMAT'] = 'json'
setting['LOG_LEVEL'] = 'INFO'
setting['FEED_URI'] = spider_name+'.json'
setting['LOG_FILE'] = spider_name+'.log'
process = CrawlerProcess(setting)
print("Running spider %s" % spider_name)
process.crawl(spider_name)
process.start()
print("Completed")
回答1:
According to scrapy docs using single CrawlerProcess
for multiple spiders should look like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class Spider1(scrapy.Spider):
...
class Spider2(scrapy.Spider):
...
process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider2)
process.start()
setting.. settings on per spider basis can be done using custom_settings spider attribute
Scrapy has a group of modules that can't be set on per spider basis (only per
CrawlerProcess
).
modules that using Logging, SpiderLoader and twisted Reactor related settings - already initialized before Scrapy read spider custom_settings
.
When you call scrapy crawl ....
from command line tool - in fact you create single CrawlerProcess for single spider defined on command args.
process terminates as soon as the second spider completes execution.
If you used the same spider versions previously launched by scrapy crawl...
this is not expected.
来源:https://stackoverflow.com/questions/62442491/how-to-specify-different-process-settings-for-two-different-spiders-in-crawlerpr