I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?
______Using Python 3.3 and not 2.x________
I would like to add a small thing to Ben Blank's answer.
from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc
By now, I just got the domain name from urlparse.
To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk
- co.uk
is a TLD while in http://sub.twitter.com
we have only .com
as TLD and sub
is a subdomain.
So, we need to get a file/list which has all the tlds.
tlds = load_file("tlds.txt") #tlds holds the list of tlds
hostname = u.split(".")
if len(hostname)>2:
if hostname[-2].upper() in tlds:
hostname=".".join(hostname[-3:])
else:
hostname=".".join(hostname[-2:])
else:
hostname=".".join(hostname[-2:])