问题
I am working on a simple url parser: the idea is to take a url in one column, attempt to resolve it and print out the output of where it redirects to.
I have the basic functionality working, however every so often it throws a http.client.RemoteDisconnected exception and the program stops: throwing a few errrors (below).
Traceback (most recent call last):
File "URLIFIER.py", line 43, in <module>
row.append(urlparse(row[0]))
File "URLIFIER.py", line 12, in urlparse
conn = urllib.request.urlopen(urlColumnElem,timeout=8)
File "//anaconda/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "//anaconda/lib/python3.5/urllib/request.py", line 466, in open
response = self._open(req, data)
File "//anaconda/lib/python3.5/urllib/request.py", line 484, in _open
'_open', req)
File "//anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "//anaconda/lib/python3.5/urllib/request.py", line 1282, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "//anaconda/lib/python3.5/urllib/request.py", line 1257, in do_open
r = h.getresponse()
File "//anaconda/lib/python3.5/http/client.py", line 1197, in getresponse
response.begin()
File "//anaconda/lib/python3.5/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "//anaconda/lib/python3.5/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
This happened after i stepped through around 4K urls in about 40 minutes. Sometimes if i just rerun the script again (same input), it would go through and complete with no issues. I've read that some website attempt to stop pythons urlopen to reduce network load, and that setting a user-agent would help. Is the lack of a user-agent being set causing this issue?
For function that does most of the legwork is below:
def urlparse(urlColumnElem):
try:
#default timeout is 8 seconds.
conn = urllib.request.urlopen(urlColumnElem,timeout=8)
redirect=conn.geturl()
#check redirect
if(redirect == urlColumnElem):
#print ("same: ")
#print(redirect)
return (redirect)
else:
#print("Not the same url ")
return(redirect)
#catch all the exceptions
except urllib.error.HTTPError as e:
return (e.code)
except urllib.error.URLError as e:
return ('URL_Error')
except socket.timeout as e:
return ("timeout")
回答1:
Solved: it is actually very simple:
add the
http.client.HTTPException
. in python2 it would be
httplib.HTTPException as e:
ie
def urlparse(urlColumnElem):
try:
#default timeout is 8 seconds.
conn = urllib.request.urlopen(urlColumnElem,timeout=8)
redirect=conn.geturl()
#check redirect
if(redirect == urlColumnElem):
#print ("same: ")
#print(redirect)
return (redirect)
else:
#print("Not the same url ")
return(redirect)
#catch all the exceptions
except urllib.error.HTTPError as e:
return (e.code)
except urllib.error.URLError as e:
return ('URL_Error')
except socket.timeout as e:
return ("timeout")
except http.client.HTTPException as e:
return("HTTPException")
来源:https://stackoverflow.com/questions/43676939/http-client-remotedisconnected-error-while-reading-parsing-a-list-of-urls