During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not
You could pass a lambda as an errback:
request = Request(url, dont_filter=True,callback = self.parse, errback = lambda x: self.download_errback(x, url))
that way you'll have access to the url inside the errback function:
def download_errback(self, e, url):
print url
you can override the RETRY_HTTP_CODES in settings.py
This is the settings I use for proxy errors:
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]