I\'m running scrapy from a script but all it does is activate the spider. It doesn\'t go through my item pipeline. I\'ve read http://scrapy.readthedocs.org/en/latest/topics/
@Pawel's and the docs' solution was not working for me and, after looking at Scrapy's source code, I realized that in some cases it was not identifying the settings module correctly. I was wondering why the pipelines were not being used until I realized that they were never found from the script in the first place.
As the docs and Pawel state, I was using:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
crawler = Crawler(settings)
but, when calling:
print "these are the pipelines:"
print crawler.settings.__dict__['attributes']['ITEM_PIPELINES']
I got:
these are the pipelines:
<SettingsAttribute value={} priority=0>
settings
wasn't getting properly populated.
I realized that what is required is a path to the project's settings module, relative to the module containing the script that calls Scrapy e.g. scrapy.myproject.settings
. Then, I created the Settings()
object as follows:
from scrapy.settings import Settings
settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scraper.edx_bot.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path, priority='project')
The complete code I used, which effectively imported the pipelines, is:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from scrapy.myproject.spiders.first_spider import FirstSpider
spider = FirstSpider()
settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy.myproject.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path, priority='project')
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=log.INFO)
reactor.run()
You need to actually call get_project_settings, Settings object that you are passing to your crawler in your posted code will give you defaults, not your specific project settings. You need to write something like this:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
crawler = Crawler(settings)