发表新帖

发表新帖

Get Root Domain of Link

后端未结

关注

 7  1403

半阙折子戏 2021-01-17 08:13

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

7条回答

执笔经年 (楼主)

2021-01-17 08:49
______Using Python 3.3 and not 2.x________

I would like to add a small thing to Ben Blank's answer.
```
from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc
```
By now, I just got the domain name from urlparse.

To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk - co.uk is a TLD while in http://sub.twitter.com we have only .com as TLD and sub is a subdomain.

So, we need to get a file/list which has all the tlds.

tlds = load_file("tlds.txt") #tlds holds the list of tlds
```
hostname = u.split(".")
if len(hostname)>2:
    if hostname[-2].upper() in tlds:
        hostname=".".join(hostname[-3:])
    else:
        hostname=".".join(hostname[-2:])
else:
    hostname=".".join(hostname[-2:])
```
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题