Get Root Domain of Link

后端 未结 7 1403
半阙折子戏
半阙折子戏 2021-01-17 08:13

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

7条回答
  •  执笔经年
    2021-01-17 08:49

    ______Using Python 3.3 and not 2.x________

    I would like to add a small thing to Ben Blank's answer.

    from urllib.parse import quote,unquote,urlparse
    u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
    g=urlparse(u)
    u=g.netloc
    

    By now, I just got the domain name from urlparse.

    To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk - co.uk is a TLD while in http://sub.twitter.com we have only .com as TLD and sub is a subdomain.

    So, we need to get a file/list which has all the tlds.

    tlds = load_file("tlds.txt") #tlds holds the list of tlds

    hostname = u.split(".")
    if len(hostname)>2:
        if hostname[-2].upper() in tlds:
            hostname=".".join(hostname[-3:])
        else:
            hostname=".".join(hostname[-2:])
    else:
        hostname=".".join(hostname[-2:])
    

提交回复
热议问题