Get Root Domain of Link

后端 未结 7 1406
半阙折子戏
半阙折子戏 2021-01-17 08:13

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

相关标签:
7条回答
  • 2021-01-17 08:49

    ______Using Python 3.3 and not 2.x________

    I would like to add a small thing to Ben Blank's answer.

    from urllib.parse import quote,unquote,urlparse
    u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
    g=urlparse(u)
    u=g.netloc
    

    By now, I just got the domain name from urlparse.

    To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk - co.uk is a TLD while in http://sub.twitter.com we have only .com as TLD and sub is a subdomain.

    So, we need to get a file/list which has all the tlds.

    tlds = load_file("tlds.txt") #tlds holds the list of tlds

    hostname = u.split(".")
    if len(hostname)>2:
        if hostname[-2].upper() in tlds:
            hostname=".".join(hostname[-3:])
        else:
            hostname=".".join(hostname[-2:])
    else:
        hostname=".".join(hostname[-2:])
    
    0 讨论(0)
  • 2021-01-17 08:55
    def get_domain(url):
        u = urlsplit(url)
        return u.netloc
    
    def get_top_domain(url):
        u"""
        >>> get_top_domain('http://www.google.com')
        'google.com'
        >>> get_top_domain('http://www.sina.com.cn')
        'sina.com.cn'
        >>> get_top_domain('http://bbc.co.uk')
        'bbc.co.uk'
        >>> get_top_domain('http://mail.cs.buaa.edu.cn')
        'buaa.edu.cn'
        """
        domain = get_domain(url)
        domain_parts = domain.split('.')
        if len(domain_parts) < 2:
            return domain
        top_domain_parts = 2
        # if a domain's last part is 2 letter long, it must be country name
        if len(domain_parts[-1]) == 2:
            if domain_parts[-1] in ['uk', 'jp']:
                if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
                    top_domain_parts = 3
            else:
                if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
                    top_domain_parts = 3
        return '.'.join(domain_parts[-top_domain_parts:])
    
    0 讨论(0)
  • 2021-01-17 08:56

    Getting the hostname is easy enough using urlparse:

    hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
    

    Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

    One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:

    import publicsuffix
    import urlparse
    
    def get_base_domain(url):
        # This causes an HTTP request; if your script is running more than,
        # say, once a day, you'd want to cache it yourself.  Make sure you
        # update frequently, though!
        psl = publicsuffix.fetch()
    
        hostname = urlparse.urlparse(url).hostname
    
        return publicsuffix.get_public_suffix(hostname, psl)
    
    0 讨论(0)
  • 2021-01-17 08:59

    Following script is not perfect, but can be used for display/shortening purposes. If you really want/need to avoid any 3rd party dependencies - especially remotely fetching and caching some tld data I can suggest you following script which I use in my projects. It uses last two parts of domain for most common domain extensions and leaves last three parts for rest of the less known domain extensions. In worst case scenario domain will have three parts instead of two:

    from urlparse import urlparse
    
    def extract_domain(url):
        parsed_domain = urlparse(url)
        domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
        domain_parts = domain.split('.')
        if len(domain_parts) > 2:
            return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
                'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
        return domain
    
    extract_domain('google.com')          # google.com
    extract_domain('www.google.com')      # google.com
    extract_domain('sub.sub2.google.com') # google.com
    extract_domain('google.co.uk')        # google.co.uk
    extract_domain('sub.google.co.uk')    # google.co.uk
    extract_domain('www.google.com')      # google.com
    extract_domain('sub.sub2.voila.fr')   # sub2.voila.fr
    
    0 讨论(0)
  • 2021-01-17 09:02

    This worked for my purposes. I figured I'd share it.

    ".".join("www.sun.google.com".split(".")[-2:])
    
    0 讨论(0)
  • 2021-01-17 09:04

    General structure of URL:

    scheme://netloc/path;parameters?query#fragment

    As TIMTOWTDI motto:

    Using urlparse,

    >>> from urllib.parse import urlparse  # python 3.x
    >>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever')  # returns six components
    >>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
    >>> result = domain.replace('www.', '')  # as per your case
    >>> print(result)
    'stackoverflow.com/'  
    

    Using tldextract,

    >>> import tldextract  # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
    >>> tldextract.extract('http://forums.news.cnn.com/')
    ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
    

    in your case:

    >>> extracted = tldextract.extract('http://www.techcrunch.com/')
    >>> '{}.{}'.format(extracted.domain, extracted.suffix)
    'techcrunch.com'
    

    tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

    Cheerio! :)

    0 讨论(0)
提交回复
热议问题