Read timeout using either urllib2 or any other http library

后端 未结 8 824
花落未央
花落未央 2020-11-29 05:36

I have code for reading an url like this:

from urllib2 import Request, urlopen
req = Request(url)
for key, val in headers.items():
    req.add_header(key, va         


        
相关标签:
8条回答
  • 2020-11-29 05:40

    This isn't the behavior I see. I get a URLError when the call times out:

    from urllib2 import Request, urlopen
    req = Request('http://www.google.com')
    res = urlopen(req,timeout=0.000001)
    #  Traceback (most recent call last):
    #  File "<stdin>", line 1, in <module>
    #  ...
    #  raise URLError(err)
    #  urllib2.URLError: <urlopen error timed out>
    

    Can't you catch this error and then avoid trying to read res? When I try to use res.read() after this I get NameError: name 'res' is not defined. Is something like this what you need:

    try:
        res = urlopen(req,timeout=3.0)
    except:           
        print 'Doh!'
    finally:
        print 'yay!'
        print res.read()
    

    I suppose the way to implement a timeout manually is via multiprocessing, no? If the job hasn't finished you can terminate it.

    0 讨论(0)
  • 2020-11-29 05:46

    I found in my tests (using the technique described here) that a timeout set in the urlopen() call also effects the read() call:

    import urllib2 as u
    c = u.urlopen('http://localhost/', timeout=5.0)
    s = c.read(1<<20)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/socket.py", line 380, in read
        data = self._sock.recv(left)
      File "/usr/lib/python2.7/httplib.py", line 561, in read
        s = self.fp.read(amt)
      File "/usr/lib/python2.7/httplib.py", line 1298, in read
        return s + self._file.read(amt - len(s))
      File "/usr/lib/python2.7/socket.py", line 380, in read
        data = self._sock.recv(left)
    socket.timeout: timed out
    

    Maybe it's a feature of newer versions? I'm using Python 2.7 on a 12.04 Ubuntu straight out of the box.

    0 讨论(0)
  • 2020-11-29 05:55

    One possible (imperfect) solution is to set the global socket timeout, explained in more detail here:

    import socket
    import urllib2
    
    # timeout in seconds
    socket.setdefaulttimeout(10)
    
    # this call to urllib2.urlopen now uses the default timeout
    # we have set in the socket module
    req = urllib2.Request('http://www.voidspace.org.uk')
    response = urllib2.urlopen(req)
    

    However, this only works if you're willing to globally modify the timeout for all users of the socket module. I'm running the request from within a Celery task, so doing this would mess up timeouts for the Celery worker code itself.

    I'd be happy to hear any other solutions...

    0 讨论(0)
  • 2020-11-29 05:56

    Had the same issue with socket timeout on the read statement. What worked for me was putting both the urlopen and the read inside a try statement. Hope this helps!

    0 讨论(0)
  • 2020-11-29 05:59

    I'd expect this to be a common problem, and yet - no answers to be found anywhere... Just built a solution for this using timeout signal:

    import urllib2
    import socket
    
    timeout = 10
    socket.setdefaulttimeout(timeout)
    
    import time
    import signal
    
    def timeout_catcher(signum, _):
        raise urllib2.URLError("Read timeout")
    
    signal.signal(signal.SIGALRM, timeout_catcher)
    
    def safe_read(url, timeout_time):
        signal.setitimer(signal.ITIMER_REAL, timeout_time)
        url = 'http://uberdns.eu'
        content = urllib2.urlopen(url, timeout=timeout_time).read()
        signal.setitimer(signal.ITIMER_REAL, 0)
        # you should also catch any exceptions going out of urlopen here,
        # set the timer to 0, and pass the exceptions on.
    

    The credit for the signal part of the solution goes here btw: python timer mystery

    0 讨论(0)
  • 2020-11-29 06:00

    It's not possible for any library to do this without using some kind of asynchronous timer through threads or otherwise. The reason is that the timeout parameter used in httplib, urllib2 and other libraries sets the timeout on the underlying socket. And what this actually does is explained in the documentation.

    SO_RCVTIMEO

    Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.

    The bolded part is key. A socket.timeout is only raised if not a single byte has been received for the duration of the timeout window. In other words, this is a timeout between received bytes.

    A simple function using threading.Timer could be as follows.

    import httplib
    import socket
    import threading
    
    def download(host, path, timeout = 10):
        content = None
        
        http = httplib.HTTPConnection(host)
        http.request('GET', path)
        response = http.getresponse()
        
        timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])
        timer.start()
        
        try:
            content = response.read()
        except httplib.IncompleteRead:
            pass
            
        timer.cancel() # cancel on triggered Timer is safe
        http.close()
        
        return content
    
    >>> host = 'releases.ubuntu.com'
    >>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)
    >>> print content is None
    True
    >>> content = download(host, '/15.04/MD5SUMS', 1)
    >>> print content is None
    False
    

    Other than checking for None, it's also possible to catch the httplib.IncompleteRead exception not inside the function, but outside of it. The latter case will not work though if the HTTP request doesn't have a Content-Length header.

    0 讨论(0)
提交回复
热议问题