I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshorteni
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive
.
After a small test, unshorten.me seems to take the HEAD
method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD
HTTP method, instead of GET
, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
The src code is on github @ https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url