How can I un-shorten a URL using python?

后端未结

关注

 4  1512

孤城傲影

I have seen this thread already - How can I unshorten a URL?

My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshorteni

相关标签:

4条回答

我寻月下人不归

2020-12-08 12:45
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:

A short link is a key into somebody else's database; you can't expand the link without querying the database

Now to your question.

Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?

The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.

After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
```
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me

HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
```
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.

Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.

You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.

You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
0 讨论(0)
发布评论:

提交评论
- 加载中...

灰色年华

2020-12-08 12:52

Here a src code that takes into account almost of the useful corner cases:

set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.

The src code is on github @ https://github.com/amirkrifa/UnShortenUrl

comments are welcome ...

import logging
logging.basicConfig(level=logging.DEBUG)

TIMEOUT = 10
class UnShortenUrl:
    def process(self, url, previous_url=None):
        logging.info('Init url: %s'%url)
        import urlparse
        import httplib
        try:
            parsed = urlparse.urlparse(url)
            if parsed.scheme == 'https':
                h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
            else:
                h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
            resource = parsed.path
            if parsed.query != "": 
                resource += "?" + parsed.query
            try:
                h.request('HEAD', 
                          resource, 
                          headers={'User-Agent': 'curl/7.38.0'}

                          )
                response = h.getresponse()
            except:
                import traceback
                traceback.print_exec()
                return url
            logging.info('Response status: %d'%response.status)
            if response.status/100 == 3 and response.getheader('Location'):
                red_url = response.getheader('Location')
                logging.info('Red, previous: %s, %s'%(red_url, previous_url))
                if red_url == previous_url:
                    return red_url
                return self.process(red_url, previous_url=url) 
            else:
                return url 
        except:
            import traceback
            traceback.print_exc()
            return None

0 讨论(0)

难免孤独

2020-12-08 13:01
one line functions, using requests library and yes, it supports recursion.
```
def unshorten_url(url):
    return requests.head(url, allow_redirects=True).url
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

感动是毒

2020-12-08 13:08

Use the best rated answer (not the accepted answer) in that question:

# This is for Py2k.  For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse

def unshorten_url(url):
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    resource = parsed.path
    if parsed.query != "":
        resource += "?" + parsed.query
    h.request('HEAD', resource )
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
    else:
        return url

0 讨论(0)