Scrapy crawl from script always blocks script execution after scraping

后端 未结 2 1976
一整个雨季
一整个雨季 2020-11-30 06:47

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:



        
相关标签:
2条回答
  • In scrapy 0.19.x you should do this:

    from twisted.internet import reactor
    from scrapy.crawler import Crawler
    from scrapy import log, signals
    from testspiders.spiders.followall import FollowAllSpider
    from scrapy.utils.project import get_project_settings
    
    spider = FollowAllSpider(domain='scrapinghub.com')
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run() # the script will block here until the spider_closed signal was sent
    

    Note these lines

    settings = get_project_settings()
    crawler = Crawler(settings)
    

    Without it your spider won't use your settings and will not save the items. Took me a while to figure out why the example in documentation wasn't saving my items. I sent a pull request to fix the doc example.

    One more way to do it is just call command directly from you script

    from scrapy import cmdline
    cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name
    
    0 讨论(0)
  • 2020-11-30 07:14

    You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed signal:

    from twisted.internet import reactor
    
    from scrapy import log, signals
    from scrapy.crawler import Crawler
    from scrapy.settings import Settings
    from scrapy.xlib.pydispatch import dispatcher
    
    from testspiders.spiders.followall import FollowAllSpider
    
    def stop_reactor():
        reactor.stop()
    
    dispatcher.connect(stop_reactor, signal=signals.spider_closed)
    spider = FollowAllSpider(domain='scrapinghub.com')
    crawler = Crawler(Settings())
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    log.msg('Running reactor...')
    reactor.run()  # the script will block here until the spider is closed
    log.msg('Reactor stopped.')
    

    And the command line log output might look something like:

    stav@maia:/srv/scrapy/testspiders$ ./api
    2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor...
    2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished)
    2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 23934,...}
    2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished)
    2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped.
    stav@maia:/srv/scrapy/testspiders$
    
    0 讨论(0)
提交回复
热议问题