Get Root Domain of Link

后端 未结 7 1405
半阙折子戏
半阙折子戏 2021-01-17 08:13

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

7条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-17 08:56

    Getting the hostname is easy enough using urlparse:

    hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
    

    Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

    One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:

    import publicsuffix
    import urlparse
    
    def get_base_domain(url):
        # This causes an HTTP request; if your script is running more than,
        # say, once a day, you'd want to cache it yourself.  Make sure you
        # update frequently, though!
        psl = publicsuffix.fetch()
    
        hostname = urlparse.urlparse(url).hostname
    
        return publicsuffix.get_public_suffix(hostname, psl)
    

提交回复
热议问题