问题
My scrapy application is outputting this long chain of exceptions and I am failing to see what the issue is and the last one has me especially confused.
Before I explain why here is the chain:
2020-11-04 17:38:58,394:ERROR:Error while obtaining start requests
Traceback (most recent call last):
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 1347, in getresponse
response.begin()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 307, in begin
version, status, reason = self._read_status()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\util\retry.py", line 403, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\packages\six.py", line 734, in reraise
raise value.with_traceback(tb)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 1347, in getresponse
response.begin()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 307, in begin
version, status, reason = self._read_status()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 35, in _update
r = requests.get(url=self.URL)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
request = next(slot.start_requests)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy_splash\middleware.py", line 167, in process_start_requests
for req in start_requests:
File "C:\Users\lguarro\Documents\Work\SearchEngine_Pure\SearchEngine_Pure\spiders\SearchEngine.py", line 36, in start_requests
user_agent = self.ua.random_nomobile
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 120, in random_nomobile
return self.pickrandom(exclude_mobile=True)
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 83, in pickrandom
self.update()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 59, in update
self._update()
File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 38, in _update
self.logger.error(r.content.decode('utf-8'))
UnboundLocalError: local variable 'r' referenced before assignment
Now the last exception is complaining about some
UnboundLocalError: local variable 'r' referenced before assignment
The only code which is mine in that trace is the SearchEngine.py file which doesn't even have a variable 'r' thus leaving me very confused. Here is the implementation of start_requests from which the error occurs:
def start_requests(self):
user_agent = self.ua.random_nomobile # Exception raised here
rec = self.mh.FindIdleOneWithNoURLs()
if rec:
self.logger.info("Starting url scrape for company, %s using user agent: %s", rec["Company"], user_agent)
script = self.template.substitute(useragent=user_agent, searchquery=rec["Company"])
yield SplashRequest(url=self.url, callback=self.parse, endpoint="execute",
args={
'lua_source': script
},
meta={'RecID': rec["_id"], 'Company': rec["Company"]},
errback = self.logerror
)
It is complaining about the first line in that function for which I see no problem.
In case it is relevant, I will also add that my script seemed to be running fine just yesterday but today I had to reset my Docker configuration (that the splash container is running in) and since then I haven't been able to run my script smoothly.
回答1:
I found out what was causing the issue! In fact there was no error on part, instead it is a bug inside the shadow-useragent library.
The library periodically makes an API request to fetch a list of the most used user agents and the server corresponding to this API is down and the authors of shadow-useragent were not properly handling the exception.
Fortunately shadow-useragent does cache the list of user agents that it was most recently able to receive. So my solution was to edit the shadow-useragent code to bypass the update function entirely and to use the cached list (inside the data.pk file) beyond its scheduled update. If anyone else runs into this issue, this is a temporary solution you can use until that server gets up and running again.. hopefully soon!
来源:https://stackoverflow.com/questions/64684489/long-chain-of-exceptions-in-scrapy-splash-application