How to check redirected web page address, without downloading it in Python

梦想与她 提交于 2019-12-11 06:49:21

问题


For a given url, how can I detect final internet location after HTTP redirects, without downloading final page (e.g. HEAD request.) using python. I am trying to write a mass downloader, my downloading mechanism needs to know internet location of page before downloading it.

edit

I ended up doing this, I hope this helps other people. I am still open to other methods.

import urlparse
import httplib

def getFinalUrl(url):
    "Navigates Through redirections to get final url."
    parsed = urlparse.urlparse(url)
    conn = httplib.HTTPConnection(parsed.netloc)
    conn.request("HEAD",parsed.path)
    response = conn.getresponse()
    if str(response.status).startswith("3"):
        new_location = [v for k,v in response.getheaders() if k == "location"][0]
        return getFinalUrl(new_location)
    return url

回答1:


You can use httplib to send HEAD requests.




回答2:


You can also have a look at python-requests, which seems to be the new trendy API for HTTP requests, replacing the possibly awkward httplib2. (see Why Not httplib2)

It also has a head() method for this.




回答3:


I strongly suggest you to use requests library. It is well coded and actively maintained. Requests can make anything you need like prefetch/

From the Requests' documentation http://docs.python-requests.org/en/latest/user/advanced/ :

By default, when you make a request, the body of the response is downloaded immediately. You can override this behavior and defer downloading the response body until you access the Response.content attribute with the prefetch parameter:

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, prefetch=False)

At this point only the response headers have been downloaded and the connection remains open, hence allowing us to make content retrieval conditional:

if int(r.headers['content-length']) < TOO_LONG:
  content = r.content
  ...

You can further control the workflow by use of the Response.iter_content and Response.iter_lines methods, or reading from the underlying urllib3 urllib3.HTTPResponse at Response.raw



来源:https://stackoverflow.com/questions/7484473/how-to-check-redirected-web-page-address-without-downloading-it-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!