Running scrapy from script not including pipeline

前端未结

关注

 2  908

I\'m running scrapy from a script but all it does is activate the spider. It doesn\'t go through my item pipeline. I\'ve read http://scrapy.readthedocs.org/en/latest/topics/

相关标签:

2条回答

悲哀的现实

2021-01-04 15:36

@Pawel's and the docs' solution was not working for me and, after looking at Scrapy's source code, I realized that in some cases it was not identifying the settings module correctly. I was wondering why the pipelines were not being used until I realized that they were never found from the script in the first place.

As the docs and Pawel state, I was using:

from scrapy.utils.project import get_project_settings
settings = get_project_settings()
crawler = Crawler(settings)

but, when calling:

print "these are the pipelines:"
print crawler.settings.__dict__['attributes']['ITEM_PIPELINES']

I got:

these are the pipelines:
<SettingsAttribute value={} priority=0>

settings wasn't getting properly populated.

I realized that what is required is a path to the project's settings module, relative to the module containing the script that calls Scrapy e.g. scrapy.myproject.settings. Then, I created the Settings() object as follows:

from scrapy.settings import Settings

settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scraper.edx_bot.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path, priority='project')

The complete code I used, which effectively imported the pipelines, is:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from scrapy.myproject.spiders.first_spider import FirstSpider

spider = FirstSpider()

settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy.myproject.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path, priority='project')
crawler = Crawler(settings)

crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=log.INFO)
reactor.run()

0 讨论(0)

栀梦

2021-01-04 15:48
You need to actually call get_project_settings, Settings object that you are passing to your crawler in your posted code will give you defaults, not your specific project settings. You need to write something like this:
```
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
crawler = Crawler(settings)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...