http.client.RemoteDisconnected error while reading/parsing a list of URL's

谁都会走 提交于 2019-12-25 07:46:04

问题


I am working on a simple url parser: the idea is to take a url in one column, attempt to resolve it and print out the output of where it redirects to.

I have the basic functionality working, however every so often it throws a http.client.RemoteDisconnected exception and the program stops: throwing a few errrors (below).

Traceback (most recent call last):
  File "URLIFIER.py", line 43, in <module>
    row.append(urlparse(row[0]))
  File "URLIFIER.py", line 12, in urlparse
    conn = urllib.request.urlopen(urlColumnElem,timeout=8)
  File "//anaconda/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "//anaconda/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "//anaconda/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "//anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "//anaconda/lib/python3.5/urllib/request.py", line 1282, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "//anaconda/lib/python3.5/urllib/request.py", line 1257, in do_open
    r = h.getresponse()
  File "//anaconda/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "//anaconda/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "//anaconda/lib/python3.5/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

This happened after i stepped through around 4K urls in about 40 minutes. Sometimes if i just rerun the script again (same input), it would go through and complete with no issues. I've read that some website attempt to stop pythons urlopen to reduce network load, and that setting a user-agent would help. Is the lack of a user-agent being set causing this issue?

For function that does most of the legwork is below:

def urlparse(urlColumnElem):
    try:
        #default timeout is 8 seconds.
        conn = urllib.request.urlopen(urlColumnElem,timeout=8)
        redirect=conn.geturl()
        #check redirect
        if(redirect == urlColumnElem):
            #print ("same: ")
            #print(redirect)
            return (redirect)
        else:
            #print("Not the same url ")
            return(redirect)
    #catch all the exceptions
    except urllib.error.HTTPError as e:
        return (e.code)
    except urllib.error.URLError as e:
        return ('URL_Error')
    except socket.timeout as e:
        return ("timeout")

回答1:


Solved: it is actually very simple:

add the

http.client.HTTPException

. in python2 it would be

httplib.HTTPException as e:

ie

def urlparse(urlColumnElem):
    try:
        #default timeout is 8 seconds.
        conn = urllib.request.urlopen(urlColumnElem,timeout=8)
        redirect=conn.geturl()
        #check redirect
        if(redirect == urlColumnElem):
            #print ("same: ")
            #print(redirect)
            return (redirect)
        else:
            #print("Not the same url ")
            return(redirect)
    #catch all the exceptions
    except urllib.error.HTTPError as e:
        return (e.code)
    except urllib.error.URLError as e:
        return ('URL_Error')
    except socket.timeout as e:
        return ("timeout")
    except  http.client.HTTPException as e:
        return("HTTPException")


来源:https://stackoverflow.com/questions/43676939/http-client-remotedisconnected-error-while-reading-parsing-a-list-of-urls

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!