CrawlerProcess vs CrawlerRunner

前端 未结 2 2008

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:

  • using CrawlerProcess
  • using CrawlerRunner
相关标签:
2条回答
  • 2020-12-29 04:09

    Scrapy's documentation does a pretty bad job at giving examples on real applications of both.

    CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.

    from scrapy.crawler import CrawlerProcess
    import scrapy
    def notThreadSafe(x):
        """do something that isn't thread-safe"""
        # ...
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start() # the script will block here until all crawling jobs are finished
    notThreadSafe(3) # it will get executed when the crawlers stop
    

    Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?

    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    import scrapy
    
    def notThreadSafe(x):
        """do something that isn't thread-safe"""
        # ...
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.callFromThread(notThreadSafe, 3)
    reactor.run() #it will run both crawlers and code inside the function
    

    The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)

    0 讨论(0)
  • 2020-12-29 04:21

    CrawlerRunner:

    This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that manually handle the crawling process. See Run Scrapy from a script for an example.

    CrawlerProcess:

    This utility should be a better fit than CrawlerRunner if you aren’t running another Twisted reactor within your application.

    It sounds like the CrawlerProcess is what you want unless you're adding your crawlers to an existing Twisted application.

    0 讨论(0)
提交回复
热议问题