Scrapy: non-blocking pause

前端 未结 3 1434
南方客
南方客 2021-01-31 10:31

I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause.

It

相关标签:
3条回答
  • 2021-01-31 11:06

    If you're attempting to use this for rate limiting, you probably just want to use DOWNLOAD_DELAY instead.

    Scrapy is just a framework on top of Twisted. For the most part, you can treat it the same as any other twisted app. Instead of calling sleep, just return the next request to make and tell twisted to wait a bit. Ex:

    from twisted.internet import reactor, defer
    
    def non_stop_function(self, response)
        d = defer.Deferred()
        reactor.callLater(10.0, d.callback, Request(
            'some url',
            callback=self.non_stop_function
        ))
        return d
    
    0 讨论(0)
  • 2021-01-31 11:07

    Request object has callback parameter, try to use that one for the purpose. I mean, create a Deferred which wraps self.second_parse_function and pause.

    Here is my dirty and not tested example, changed lines are marked.

    class ScrapySpider(Spider):
        name = 'live_function'
    
        def start_requests(self):
            yield Request('some url', callback=self.non_stop_function)
    
        def non_stop_function(self, response):
    
            parse_and_pause = Deferred()  # changed
            parse_and_pause.addCallback(self.second_parse_function) # changed
            parse_and_pause.addCallback(pause, seconds=10)  # changed
    
            for url in ['url1', 'url2', 'url3', 'more urls']:
                yield Request(url, callback=parse_and_pause)  # changed
    
            yield Request('some url', callback=self.non_stop_function)  # Call itself
    
        def second_parse_function(self, response):
            pass
    

    If the approach works for you then you can create a function which constructs a Deferred object according to the rule. It could be implemented in the way like the following:

    def get_perform_and_pause_deferred(seconds, fn, *args, **kwargs):
        d = Deferred()
        d.addCallback(fn, *args, **kwargs)
        d.addCallback(pause, seconds=seconds)
        return d
    

    And here is possible usage:

    class ScrapySpider(Spider):
        name = 'live_function'
    
        def start_requests(self):
            yield Request('some url', callback=self.non_stop_function)
    
        def non_stop_function(self, response):
            for url in ['url1', 'url2', 'url3', 'more urls']:
                # changed
                yield Request(url, callback=get_perform_and_pause_deferred(10, self.second_parse_function))
    
            yield Request('some url', callback=self.non_stop_function)  # Call itself
    
        def second_parse_function(self, response):
            pass
    
    0 讨论(0)
  • 2021-01-31 11:09

    The asker already provides an answer in the question's update, but I want to give a slightly better version so it's reusable for any request.

    # removed...
    from twisted.internet import reactor, defer
    
    class MySpider(scrapy.Spider):
        # removed...
    
        def request_with_pause(self, response):
            d = defer.Deferred()
            reactor.callLater(response.meta['time'], d.callback, scrapy.Request(
                response.url,
                callback=response.meta['callback'],
                dont_filter=True, meta={'dont_proxy':response.meta['dont_proxy']}))
            return d
    
        def parse(self, response):
            # removed....
            yield scrapy.Request(the_url, meta={
                                'time': 86400, 
                                'callback': self.the_parse, 
                                'dont_proxy': True
                                }, callback=self.request_with_pause)
    

    For explanation, Scrapy use Twisted to manage the request asynchronously, so we need Twisted's tool to do a delayed request too.

    0 讨论(0)
提交回复
热议问题