Running scrapy from script not including pipeline

前端 未结 2 907
醉梦人生
醉梦人生 2021-01-04 15:00

I\'m running scrapy from a script but all it does is activate the spider. It doesn\'t go through my item pipeline. I\'ve read http://scrapy.readthedocs.org/en/latest/topics/

相关标签:
2条回答
  • 2021-01-04 15:36

    @Pawel's and the docs' solution was not working for me and, after looking at Scrapy's source code, I realized that in some cases it was not identifying the settings module correctly. I was wondering why the pipelines were not being used until I realized that they were never found from the script in the first place.

    As the docs and Pawel state, I was using:

    from scrapy.utils.project import get_project_settings
    settings = get_project_settings()
    crawler = Crawler(settings)
    

    but, when calling:

    print "these are the pipelines:"
    print crawler.settings.__dict__['attributes']['ITEM_PIPELINES']
    

    I got:

    these are the pipelines:
    <SettingsAttribute value={} priority=0>
    

    settings wasn't getting properly populated.

    I realized that what is required is a path to the project's settings module, relative to the module containing the script that calls Scrapy e.g. scrapy.myproject.settings. Then, I created the Settings() object as follows:

    from scrapy.settings import Settings
    
    settings = Settings()
    os.environ['SCRAPY_SETTINGS_MODULE'] = 'scraper.edx_bot.settings'
    settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
    settings.setmodule(settings_module_path, priority='project')
    

    The complete code I used, which effectively imported the pipelines, is:

    from twisted.internet import reactor
    from scrapy.crawler import Crawler
    from scrapy import log, signals
    from scrapy.settings import Settings
    from scrapy.utils.project import get_project_settings
    from scrapy.myproject.spiders.first_spider import FirstSpider
    
    spider = FirstSpider()
    
    settings = Settings()
    os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy.myproject.settings'
    settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
    settings.setmodule(settings_module_path, priority='project')
    crawler = Crawler(settings)
    
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start(loglevel=log.INFO)
    reactor.run()
    
    0 讨论(0)
  • 2021-01-04 15:48

    You need to actually call get_project_settings, Settings object that you are passing to your crawler in your posted code will give you defaults, not your specific project settings. You need to write something like this:

    from scrapy.utils.project import get_project_settings
    settings = get_project_settings()
    crawler = Crawler(settings)
    
    0 讨论(0)
提交回复
热议问题