Long chain of exceptions in scrapy splash application

醉酒当歌 提交于 2021-01-29 05:36:12

问题


My scrapy application is outputting this long chain of exceptions and I am failing to see what the issue is and the last one has me especially confused.

Before I explain why here is the chain:

2020-11-04 17:38:58,394:ERROR:Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 1347, in getresponse
    response.begin()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\util\retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\packages\six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 1347, in getresponse
    response.begin()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\http\client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 35, in _update
    r = requests.get(url=self.URL)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\requests\adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy_splash\middleware.py", line 167, in process_start_requests
    for req in start_requests:
  File "C:\Users\lguarro\Documents\Work\SearchEngine_Pure\SearchEngine_Pure\spiders\SearchEngine.py", line 36, in start_requests
    user_agent = self.ua.random_nomobile
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 120, in random_nomobile
    return self.pickrandom(exclude_mobile=True)
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 83, in pickrandom
    self.update()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 59, in update
    self._update()
  File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\shadow_useragent\core.py", line 38, in _update
    self.logger.error(r.content.decode('utf-8'))
UnboundLocalError: local variable 'r' referenced before assignment

Now the last exception is complaining about some

UnboundLocalError: local variable 'r' referenced before assignment

The only code which is mine in that trace is the SearchEngine.py file which doesn't even have a variable 'r' thus leaving me very confused. Here is the implementation of start_requests from which the error occurs:

def start_requests(self):
    user_agent = self.ua.random_nomobile # Exception raised here
    rec = self.mh.FindIdleOneWithNoURLs()
    if rec:
        self.logger.info("Starting url scrape for company, %s using user agent: %s", rec["Company"], user_agent)
        script = self.template.substitute(useragent=user_agent, searchquery=rec["Company"])
        yield SplashRequest(url=self.url, callback=self.parse, endpoint="execute", 
            args={
                'lua_source': script
            },
            meta={'RecID': rec["_id"], 'Company': rec["Company"]},
            errback = self.logerror
        )

It is complaining about the first line in that function for which I see no problem.

In case it is relevant, I will also add that my script seemed to be running fine just yesterday but today I had to reset my Docker configuration (that the splash container is running in) and since then I haven't been able to run my script smoothly.


回答1:


I found out what was causing the issue! In fact there was no error on part, instead it is a bug inside the shadow-useragent library.

The library periodically makes an API request to fetch a list of the most used user agents and the server corresponding to this API is down and the authors of shadow-useragent were not properly handling the exception.

Fortunately shadow-useragent does cache the list of user agents that it was most recently able to receive. So my solution was to edit the shadow-useragent code to bypass the update function entirely and to use the cached list (inside the data.pk file) beyond its scheduled update. If anyone else runs into this issue, this is a temporary solution you can use until that server gets up and running again.. hopefully soon!



来源:https://stackoverflow.com/questions/64684489/long-chain-of-exceptions-in-scrapy-splash-application

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!