How can I unshorten a URL?

后端 未结 9 1039
时光说笑
时光说笑 2020-11-30 05:19

I want to be able to take a shortened or non-shortened URL and return its un-shortened form. How can I make a python program to do this?

Additional Clarification:

相关标签:
9条回答
  • 2020-11-30 05:26

    http://github.com/stef/urlclean

    sudo pip install urlclean
    urlclean.unshorten(url)
    
    0 讨论(0)
  • 2020-11-30 05:27

    Here a src code that takes into account almost of the useful corner cases:

    • set a custom Timeout.
    • set a custom User Agent.
    • check whether we have to use an http or https connection.
    • resolve recursively the input url and prevent ending within a loop.

    The src code is on github @ https://github.com/amirkrifa/UnShortenUrl

    comments are welcome ...

    import logging
    logging.basicConfig(level=logging.DEBUG)
    
    TIMEOUT = 10
    class UnShortenUrl:
        def process(self, url, previous_url=None):
            logging.info('Init url: %s'%url)
            import urlparse
            import httplib
            try:
                parsed = urlparse.urlparse(url)
                if parsed.scheme == 'https':
                    h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
                else:
                    h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
                resource = parsed.path
                if parsed.query != "": 
                    resource += "?" + parsed.query
                try:
                    h.request('HEAD', 
                              resource, 
                              headers={'User-Agent': 'curl/7.38.0'}
                                       }
                              )
                    response = h.getresponse()
                except:
                    import traceback
                    traceback.print_exec()
                    return url
    
                logging.info('Response status: %d'%response.status)
                if response.status/100 == 3 and response.getheader('Location'):
                    red_url = response.getheader('Location')
                    logging.info('Red, previous: %s, %s'%(red_url, previous_url))
                    if red_url == previous_url:
                        return red_url
                    return self.process(red_url, previous_url=url) 
                else:
                    return url 
            except:
                import traceback
                traceback.print_exc()
                return None
    
    0 讨论(0)
  • 2020-11-30 05:32

    Send an HTTP HEAD request to the URL and look at the response code. If the code is 30x, look at the Location header to get the unshortened URL. Otherwise, if the code is 20x, then the URL is not redirected; you probably also want to handle error codes (4xx and 5xx) in some fashion. For example:

    # This is for Py2k.  For Py3k, use http.client and urllib.parse instead, and
    # use // instead of / for the division
    import httplib
    import urlparse
    
    def unshorten_url(url):
        parsed = urlparse.urlparse(url)
        h = httplib.HTTPConnection(parsed.netloc)
        h.request('HEAD', parsed.path)
        response = h.getresponse()
        if response.status/100 == 3 and response.getheader('Location'):
            return response.getheader('Location')
        else:
            return url
    
    0 讨论(0)
  • 2020-11-30 05:33

    Open the url and see what it resolves to:

    >>> import urllib2
    >>> a = urllib2.urlopen('http://bit.ly/cXEInp')
    >>> print a.url
    http://www.flickr.com/photos/26432908@N00/346615997/sizes/l/
    >>> a = urllib2.urlopen('http://google.com')
    >>> print a.url
    http://www.google.com/
    
    0 讨论(0)
  • 2020-11-30 05:40

    Using requests:

    import requests
    
    session = requests.Session()  # so connections are recycled
    resp = session.head(url, allow_redirects=True)
    print(resp.url)
    
    0 讨论(0)
  • 2020-11-30 05:40

    To unshort, you can use requests. This is a simple solution that works for me.

    import requests
    url = "http://foo.com"
    
    site = requests.get(url)
    print(site.url)
    
    0 讨论(0)
提交回复
热议问题