Scrapy on a schedule

前端 未结 2 837
北海茫月
北海茫月 2020-12-09 22:35

Getting Scrapy to run on a schedule is driving me around the Twist(ed).

I thought the below test code would work, but I get a twisted.internet.error.ReactorNot

相关标签:
2条回答
  • 2020-12-09 22:39

    You can use apscheduler

    pip install apscheduler
    
    # -*- coding: utf-8 -*-
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    from apscheduler.schedulers.twisted import TwistedScheduler
    
    from Demo.spiders.baidu import YourSpider
    
    process = CrawlerProcess(get_project_settings())
    scheduler = TwistedScheduler()
    scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
    scheduler.start()
    process.start(False)
    
    0 讨论(0)
  • 2020-12-09 23:03

    First noteworthy statement, there's usually only one Twisted reactor running and it's not restartable (as you've discovered). The second is that blocking tasks/functions should be avoided (ie. time.sleep(n)) and should be replaced with async alternatives (ex. 'reactor.task.deferLater(n,...)`).

    To use Scrapy effectively from a Twisted project requires the scrapy.crawler.CrawlerRunner core API as opposed to scrapy.crawler.CrawlerProcess. The main difference between the two is that CrawlerProcess runs Twisted's reactor for you (thus making it difficult to restart the reactor), where as CrawlerRunner relies on the developer to start the reactor. Here's what your code could look like with CrawlerRunner:

    from twisted.internet import reactor
    from quotesbot.spiders.quotes import QuotesSpider
    from scrapy.crawler import CrawlerRunner
    
    def run_crawl():
        """
        Run a spider within Twisted. Once it completes,
        wait 5 seconds and run another spider.
        """
        runner = CrawlerRunner({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            })
        deferred = runner.crawl(QuotesSpider)
        # you can use reactor.callLater or task.deferLater to schedule a function
        deferred.addCallback(reactor.callLater, 5, run_crawl)
        return deferred
    
    run_crawl()
    reactor.run()   # you have to run the reactor yourself
    
    0 讨论(0)
提交回复
热议问题