问题
I am using the following to check for (internet) connection errors in my spider.py
:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
def handle_error(self, failure):
if failure.check(DNSLookupError): # or failure.check(UnknownHostError):
request = failure.request
self.logger.error('DNSLookupError on: %s', request.url)
print("\nDNS Error! Please check your internet connection!\n")
elif failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on: %s', response.url)
print('\nSpider closed because of Connection issues!\n')
raise CloseSpider('Because of Connection issues!')
...
However, when the spider runs and the connection is down, I still get a Traceback (most recent call last):
messages. I would like to get rid of this by handling the error and shutting down the spider properly.
The output I get is:
2018-10-11 12:52:15 [NewAds] ERROR: DNSLookupError on: https://x.com
DNS Error! Please check your internet connection!
2018-10-11 12:52:15 [scrapy.core.scraper] ERROR: Error downloading <GET https://x.com>
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3.6/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: x.com.
From this you can notice the following:
- I am able to partially handle the (first?)
DNSLookupError
error, but... - shutting down the spider does not seem fast enough so the spider continue to try to download the URL, causing a different error (
ERROR: Error downloading
). - possibly causing a 2nd error:
twisted.internet.error.DNSLookupError:
?
How can I handle [scrapy.core.scraper] ERROR: Error downloading
and make sure the spider get shut down properly?
(Or: How can I check internet connection on spider startup?)
回答1:
Ok, I have been trying to play nice with Scrapy, trying to exit
gracefully when there is no internet connection or other error. The result? I could not get it to work properly. Instead I ended up just shutting down the entire interpreter and all it's obnoxious deferred children using os._exit(0)
, like this:
import socket
#from scrapy.exceptions import CloseSpider
...
def check_connection(self):
try:
socket.create_connection(("www.google.com", 443))
return True
except:
pass
return False
def start_requests(self):
if not self.check_connection():
print('Connection Lost! Please check your internet connection!', flush=True)
os._exit(0) # Kill Everything
#CloseSpider('Grace Me!') # Close clean but expect deferred errors!
#raise CloseSpider('No Grace') # Raise Exception (w. Traceback)?!
...
That did it!
NOTE
I tried to use various internal methods to shutdown Scrapy, and handle the obnoxious:
[scrapy.core.scraper] ERROR: Error downloading
issue. This only (?) happens when you use: raise CloseSpider('Because of Connection issues!')
among many other attempts. Again followed by a twisted.internet.error.DNSLookupError
, even though I have handled that in my code, it seem to appear out of nowhere. Obviously raise
is the manual way to always raise an exception. So instead use the CloseSpider()
without it.
The issue at hand also seem to be a re-occurring issue in the Scrapy framework...and in fact the source code has some FIXME's in there. Even when I tried to apply things like:
def stop(self):
self.deferred = defer.Deferred()
for name, signal in vars(signals).items():
if not name.startswith('_'):
disconnect_all(signal)
self.deferred.callback(None)
and using these...
#self.stop()
#sys.exit()
#disconnect_all(signal, **kwargs)
#self.crawler.engine.close_spider(spider, 'cancelled')
#scrapy.crawler.CrawlerRunner.stop()
#crawler.signals.stop()
PS. I would be great if the Scrapy devs could document how to best deal with such a simple case as a no internet-connection?
回答2:
I believe I may just have found an answer. To exit out of start_requests gracefully, return []
. This is telling it there are no requests to process.
To close a spider, call the close() method on the spider: self.close('reason')
import logging
import scrapy
import socket
class SpiderIndex(scrapy.Spider):
name = 'test'
def check_connection(self):
try:
socket.create_connection(("www.google.com", 443))
return True
except Exception:
pass
return False
def start_requests(self):
if not self.check_connection():
print('Connection Lost! Please check your internet connection!', flush=True)
self.close(self, 'Connection Lost!')
return []
# Continue as normal ...
request = scrapy.Request(url='https://www.google.com', callback=self.parse)
yield request
def parse(self, response):
self.log(f'===TEST SPIDER: PARSE REQUEST======{response.url}===========', logging.INFO)
Addendum: For some strange reason, on one spider self.close('reason')
worked while as on another I had to change it to self.close(self, 'reason')
.
来源:https://stackoverflow.com/questions/52757819/how-to-handle-connection-or-download-error-in-scrapy