Python: Get HTTP headers from urllib2.urlopen call?

前端 未结 6 1035
猫巷女王i
猫巷女王i 2020-11-27 11:22

Does urllib2 fetch the whole page when a urlopen call is made?

I\'d like to just read the HTTP response header without getting the page.

相关标签:
6条回答
  • 2020-11-27 11:28

    One-liner:

    $ python -c "import urllib2; print urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)).open(urllib2.Request('http://google.com'))"
    
    0 讨论(0)
  • 2020-11-27 11:31
    def _GetHtmlPage(self, addr):
      headers = { 'User-Agent' : self.userAgent,
                '  Cookie' : self.cookies}
    
      req = urllib2.Request(addr)
      response = urllib2.urlopen(req)
    
      print "ResponseInfo="
      print response.info()
    
      resultsHtml = unicode(response.read(), self.encoding)
      return resultsHtml  
    
    0 讨论(0)
  • 2020-11-27 11:34

    urllib2.urlopen does an HTTP GET (or POST if you supply a data argument), not an HTTP HEAD (if it did the latter, you couldn't do readlines or other accesses to the page body, of course).

    0 讨论(0)
  • 2020-11-27 11:35

    What about sending a HEAD request instead of a normal GET request. The following snipped (copied from a similar question) does exactly that.

    >>> import httplib
    >>> conn = httplib.HTTPConnection("www.google.com")
    >>> conn.request("HEAD", "/index.html")
    >>> res = conn.getresponse()
    >>> print res.status, res.reason
    200 OK
    >>> print res.getheaders()
    [('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]
    
    0 讨论(0)
  • 2020-11-27 11:41

    Use the response.info() method to get the headers.

    From the urllib2 docs:

    urllib2.urlopen(url[, data][, timeout])

    ...

    This function returns a file-like object with two additional methods:

    • geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
    • info() — return the meta-information of the page, such as headers, in the form of an httplib.HTTPMessage instance (see Quick Reference to HTTP Headers)

    So, for your example, try stepping through the result of response.info().headers for what you're looking for.

    Note the major caveat to using httplib.HTTPMessage is documented in python issue 4773.

    0 讨论(0)
  • Actually, it appears that urllib2 can do an HTTP HEAD request.

    The question that @reto linked to, above, shows how to get urllib2 to do a HEAD request.

    Here's my take on it:

    import urllib2
    
    # Derive from Request class and override get_method to allow a HEAD request.
    class HeadRequest(urllib2.Request):
        def get_method(self):
            return "HEAD"
    
    myurl = 'http://bit.ly/doFeT'
    request = HeadRequest(myurl)
    
    try:
        response = urllib2.urlopen(request)
        response_headers = response.info()
    
        # This will just display all the dictionary key-value pairs.  Replace this
        # line with something useful.
        response_headers.dict
    
    except urllib2.HTTPError, e:
        # Prints the HTTP Status code of the response but only if there was a 
        # problem.
        print ("Error code: %s" % e.code)
    

    If you check this with something like the Wireshark network protocol analazer, you can see that it is actually sending out a HEAD request, rather than a GET.

    This is the HTTP request and response from the code above, as captured by Wireshark:

    HEAD /doFeT HTTP/1.1
    Accept-Encoding: identity
    Host: bit.ly
    Connection: close
    User-Agent: Python-urllib/2.7

    HTTP/1.1 301 Moved
    Server: nginx
    Date: Sun, 19 Feb 2012 13:20:56 GMT
    Content-Type: text/html; charset=utf-8
    Cache-control: private; max-age=90
    Location: http://www.kidsidebyside.org/?p=445
    MIME-Version: 1.0
    Content-Length: 127
    Connection: close
    Set-Cookie: _bit=4f40f738-00153-02ed0-421cf10a;domain=.bit.ly;expires=Fri Aug 17 13:20:56 2012;path=/; HttpOnly

    However, as mentioned in one of the comments in the other question, if the URL in question includes a redirect then urllib2 will do a GET request to the destination, not a HEAD. This could be a major shortcoming, if you really wanted to only make HEAD requests.

    The request above involves a redirect. Here is request to the destination, as captured by Wireshark:

    GET /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
    Accept-Encoding: identity
    Host: www.kidsidebyside.org
    Connection: close
    User-Agent: Python-urllib/2.7

    An alternative to using urllib2 is to use Joe Gregorio's httplib2 library:

    import httplib2
    
    url = "http://bit.ly/doFeT"
    http_interface = httplib2.Http()
    
    try:
        response, content = http_interface.request(url, method="HEAD")
        print ("Response status: %d - %s" % (response.status, response.reason))
    
        # This will just display all the dictionary key-value pairs.  Replace this
        # line with something useful.
        response.__dict__
    
    except httplib2.ServerNotFoundError, e:
        print (e.message)
    

    This has the advantage of using HEAD requests for both the initial HTTP request and the redirected request to the destination URL.

    Here's the first request:

    HEAD /doFeT HTTP/1.1
    Host: bit.ly
    accept-encoding: gzip, deflate
    user-agent: Python-httplib2/0.7.2 (gzip)

    Here's the second request, to the destination:

    HEAD /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
    Host: www.kidsidebyside.org
    accept-encoding: gzip, deflate
    user-agent: Python-httplib2/0.7.2 (gzip)

    0 讨论(0)
提交回复
热议问题