How can I un-shorten a URL using python?

后端 未结 4 1512
孤城傲影
孤城傲影 2020-12-08 12:22

I have seen this thread already - How can I unshorten a URL?

My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshorteni

相关标签:
4条回答
  • 2020-12-08 12:45

    You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:

    A short link is a key into somebody else's database; you can't expand the link without querying the database

    Now to your question.

    Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?

    The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.

    After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:

    > telnet unshorten.me 80
    Trying 64.202.189.170...
    Connected to unshorten.me.
    Escape character is '^]'.
    HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
    Host: unshorten.me
    
    HTTP/1.1 301 Moved Permanently
    Date: Mon, 22 Aug 2011 20:42:46 GMT
    Server: Microsoft-IIS/6.0
    X-Powered-By: ASP.NET
    X-AspNet-Version: 2.0.50727
    Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
    Cache-Control: private
    Content-Length: 0
    

    So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.

    Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.

    You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.

    You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.

    0 讨论(0)
  • 2020-12-08 12:52

    Here a src code that takes into account almost of the useful corner cases:

    • set a custom Timeout.
    • set a custom User Agent.
    • check whether we have to use an http or https connection.
    • resolve recursively the input url and prevent ending within a loop.

    The src code is on github @ https://github.com/amirkrifa/UnShortenUrl

    comments are welcome ...

    import logging
    logging.basicConfig(level=logging.DEBUG)
    
    TIMEOUT = 10
    class UnShortenUrl:
        def process(self, url, previous_url=None):
            logging.info('Init url: %s'%url)
            import urlparse
            import httplib
            try:
                parsed = urlparse.urlparse(url)
                if parsed.scheme == 'https':
                    h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
                else:
                    h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
                resource = parsed.path
                if parsed.query != "": 
                    resource += "?" + parsed.query
                try:
                    h.request('HEAD', 
                              resource, 
                              headers={'User-Agent': 'curl/7.38.0'}
    
                              )
                    response = h.getresponse()
                except:
                    import traceback
                    traceback.print_exec()
                    return url
                logging.info('Response status: %d'%response.status)
                if response.status/100 == 3 and response.getheader('Location'):
                    red_url = response.getheader('Location')
                    logging.info('Red, previous: %s, %s'%(red_url, previous_url))
                    if red_url == previous_url:
                        return red_url
                    return self.process(red_url, previous_url=url) 
                else:
                    return url 
            except:
                import traceback
                traceback.print_exc()
                return None
    
    0 讨论(0)
  • 2020-12-08 13:01

    one line functions, using requests library and yes, it supports recursion.

    def unshorten_url(url):
        return requests.head(url, allow_redirects=True).url
    
    0 讨论(0)
  • 2020-12-08 13:08

    Use the best rated answer (not the accepted answer) in that question:

    # This is for Py2k.  For Py3k, use http.client and urllib.parse instead, and
    # use // instead of / for the division
    import httplib
    import urlparse
    
    def unshorten_url(url):
        parsed = urlparse.urlparse(url)
        h = httplib.HTTPConnection(parsed.netloc)
        resource = parsed.path
        if parsed.query != "":
            resource += "?" + parsed.query
        h.request('HEAD', resource )
        response = h.getresponse()
        if response.status/100 == 3 and response.getheader('Location'):
            return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
        else:
            return url
    
    0 讨论(0)
提交回复
热议问题