Scrapy - set delay to retry middleware

前端 未结 2 678
自闭症患者
自闭症患者 2021-01-15 00:23

I\'m using Scrapy-splash and I have a problem with memory. I can clearly see that memory used by docker python3 is gradually increasin

相关标签:
2条回答
  • 2021-01-15 00:34

    Method 1

    One way would be to add a middleware to your Spider (source, linked):

    # File: middlewares.py
    
    from twisted.internet import reactor
    from twisted.internet.defer import Deferred
    
    
    class DelayedRequestsMiddleware(object):
        def process_request(self, request, spider):
            delay_s = request.meta.get('delay_request_by', None)
            if not delay_s:
                return
    
            deferred = Deferred()
            reactor.callLater(delay_s, deferred.callback, None)
            return deferred
    

    Which you could later use in your Spider like this:

    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        custom_settings = {
            'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
        }
    
        def start_requests(self):
            # This request will have itself delayed by 5 seconds
            yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', 
                                 meta={'delay_request_by': 5})
            # This request will not be delayed
            yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')
    
        def parse(self, response):
            ...  # Process results here
    

    Method 2

    You could do this with a Custom Retry Middleware (source), you just need to override the process_response method of the current Retry Middleware:

    from scrapy.downloadermiddlewares.retry import RetryMiddleware
    from scrapy.utils.response import response_status_message
    
    
    class CustomRetryMiddleware(RetryMiddleware):
    
        def process_response(self, request, response, spider):
            if request.meta.get('dont_retry', False):
                return response
            if response.status in self.retry_http_codes:
                reason = response_status_message(response.status)
    
                # Your delay code here, for example sleep(10) or polling server until it is alive
    
                return self._retry(request, reason, spider) or response
    
            return response
    

    Then enable it instead of the default RetryMiddleware in settings.py:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
        'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
    }
    
    0 讨论(0)
  • 2021-01-15 00:35
    1. A more elaborate solution could be to set up a Kubernetes cluster in which you have multiple replicas running. This way you avoid having a failure of just 1 container impacting your scraping job.

    2. I don't think it's easy to configure a waiting time only for retries. You could play with DOWNLOAD_DELAY (but this will impact delay between all requests), or set the RETRY_TIMES to a higher value than the default of 2.

    0 讨论(0)
提交回复
热议问题