I\'m trying to run a scraper of which the output log ends as follows:
2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <42
-
Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.
Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.
Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.
Solutions:
Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.
If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting
讨论(0)
-
You can modify the retry middleware to pause when it gets error 429. Put this code below in middlewares.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
import time
class TooManyRequestsRetryMiddleware(RetryMiddleware):
def __init__(self, crawler):
super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
elif response.status == 429:
self.crawler.engine.pause()
time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
self.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
elif response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
Add 429 to retry codes in settings.py
RETRY_HTTP_CODES = [429]
Then activate it on settings.py
. Don't forget to deactivate the default retry middleware.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
}
讨论(0)
-
You can use HTTPERROR_ALLOWED_CODES =[404,429]
. I was getting 429 HTTP code and I just allowed it and then problem fixed. You can allow the HTTP code that you are getting in terminal. This may be solve your problem.
讨论(0)
- 热议问题