How to handle a 429 Too Many Requests response in Scrapy?

后端 未结 3 1027
深忆病人
深忆病人 2020-12-28 22:54

I\'m trying to run a scraper of which the output log ends as follows:

2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <42         


        
相关标签:
3条回答
  • 2020-12-28 23:10

    Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.

    Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.

    Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.

    Solutions:

    1. Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.

    2. If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting

    0 讨论(0)
  • 2020-12-28 23:10

    You can modify the retry middleware to pause when it gets error 429. Put this code below in middlewares.py

    from scrapy.downloadermiddlewares.retry import RetryMiddleware
    from scrapy.utils.response import response_status_message
    
    import time
    
    class TooManyRequestsRetryMiddleware(RetryMiddleware):
    
        def __init__(self, crawler):
            super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
            self.crawler = crawler
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)
    
        def process_response(self, request, response, spider):
            if request.meta.get('dont_retry', False):
                return response
            elif response.status == 429:
                self.crawler.engine.pause()
                time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
                self.crawler.engine.unpause()
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response
            elif response.status in self.retry_http_codes:
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response
            return response 
    

    Add 429 to retry codes in settings.py

    RETRY_HTTP_CODES = [429]
    

    Then activate it on settings.py. Don't forget to deactivate the default retry middleware.

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
        'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
    }
    
    0 讨论(0)
  • 2020-12-28 23:18

    You can use HTTPERROR_ALLOWED_CODES =[404,429]. I was getting 429 HTTP code and I just allowed it and then problem fixed. You can allow the HTTP code that you are getting in terminal. This may be solve your problem.

    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题