How do I catch errors with scrapy so I can do something when I get User Timeout error?

后端 未结 2 544
一整个雨季
一整个雨季 2020-12-13 21:30
ERROR: Error downloading : User timeout caused connection failure.

I get this issue every now and then when using my scraper. I

相关标签:
2条回答
  • 2020-12-13 21:48

    What you can do is define an errback in your Request instances:

    errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter.

    Here's some sample code (for scrapy 1.0) that you can use:

    # -*- coding: utf-8 -*-
    # errbacks.py
    import scrapy
    
    # from scrapy.contrib.spidermiddleware.httperror import HttpError
    from scrapy.spidermiddlewares.httperror import HttpError
    from twisted.internet.error import DNSLookupError
    from twisted.internet.error import TimeoutError
    
    
    class ErrbackSpider(scrapy.Spider):
        name = "errbacks"
        start_urls = [
            "http://www.httpbin.org/",              # HTTP 200 expected
            "http://www.httpbin.org/status/404",    # Not found error
            "http://www.httpbin.org/status/500",    # server issue
            "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
            "http://www.httphttpbinbin.org/",       # DNS error expected
        ]
    
        def start_requests(self):
            for u in self.start_urls:
                yield scrapy.Request(u, callback=self.parse_httpbin,
                                        errback=self.errback_httpbin,
                                        dont_filter=True)
    
        def parse_httpbin(self, response):
            self.logger.error('Got successful response from {}'.format(response.url))
            # do something useful now
    
        def errback_httpbin(self, failure):
            # log all errback failures,
            # in case you want to do something special for some errors,
            # you may need the failure's type
            self.logger.error(repr(failure))
    
            #if isinstance(failure.value, HttpError):
            if failure.check(HttpError):
                # you can get the response
                response = failure.value.response
                self.logger.error('HttpError on %s', response.url)
    
            #elif isinstance(failure.value, DNSLookupError):
            elif failure.check(DNSLookupError):
                # this is the original request
                request = failure.request
                self.logger.error('DNSLookupError on %s', request.url)
    
            #elif isinstance(failure.value, TimeoutError):
            elif failure.check(TimeoutError):
                request = failure.request
                self.logger.error('TimeoutError on %s', request.url)
    

    And the output in scrapy shell (only 1 retry and 5s download timeout):

    $ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1
    2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
    2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11
    2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}
    2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
    2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 
    2015-06-30 23:45:56 [scrapy] INFO: Spider opened
    2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
    2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
    2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>
    2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/
    2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)
    2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None)
    2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/
    2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
    2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404
    2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error
    2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error
    2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None)
    2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
    2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500
    2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure.
    2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure.
    2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>
    2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/
    2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)
    2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 4,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
     'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
     'downloader/request_bytes': 1748,
     'downloader/request_count': 8,
     'downloader/request_method_count/GET': 8,
     'downloader/response_bytes': 12506,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 1,
     'downloader/response_status_count/500': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191),
     'log_count/DEBUG': 10,
     'log_count/ERROR': 9,
     'log_count/INFO': 7,
     'response_received_count': 3,
     'scheduler/dequeued': 8,
     'scheduler/dequeued/memory': 8,
     'scheduler/enqueued': 8,
     'scheduler/enqueued/memory': 8,
     'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}
    2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)
    

    Notice how scrapy logs the exceptions in its stats:

    'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
    'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
    
    0 讨论(0)
  • 2020-12-13 21:52

    I prefer to have a custom Retry Middleware like this:

    from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
    
    from fake_useragent import FakeUserAgentError
    
    class FakeUserAgentErrorRetryMiddleware(RetryMiddleware):
    
        def process_exception(self, request, exception, spider):
            if type(exception) == FakeUserAgentError: return self._retry(request, exception, spider)
    
    0 讨论(0)
提交回复
热议问题