Get subdomain from URL using Python

前端 未结 8 991
孤城傲影
孤城傲影 2020-12-20 13:00

For example, the address is:

Address = http://lol1.domain.com:8888/some/page

相关标签:
8条回答
  • 2020-12-20 13:15

    What you are looking for is in: http://docs.python.org/library/urlparse.html

    for example: ".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])

    Will do the job for you (will return "www.my")

    0 讨论(0)
  • 2020-12-20 13:20

    Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL

    You will need the list of effective tlds from here

    from __future__ import with_statement
    from urlparse import urlparse
    
    # load tlds, ignore comments and empty lines:
    with open("effective_tld_names.dat.txt") as tldFile:
        tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
    
    class DomainParts(object):
        def __init__(self, domain_parts, tld):
            self.domain = None
            self.subdomains = None
            self.tld = tld
            if domain_parts:
                self.domain = domain_parts[-1]
                if len(domain_parts) > 1:
                    self.subdomains = domain_parts[:-1]
    
    def get_domain_parts(url, tlds):
        urlElements = urlparse(url).hostname.split('.')
        # urlElements = ["abcde","co","uk"]
        for i in range(-len(urlElements),0):
            lastIElements = urlElements[i:]
            #    i=-3: ["abcde","co","uk"]
            #    i=-2: ["co","uk"]
            #    i=-1: ["uk"] etc
    
            candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
            wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
            exceptionCandidate = "!"+candidate
    
            # match tlds: 
            if (exceptionCandidate in tlds):
                return ".".join(urlElements[i:]) 
            if (candidate in tlds or wildcardCandidate in tlds):
                return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
                # returns ["abcde"]
    
        raise ValueError("Domain not in global list of TLDs")
    
    domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
    print "Domain:", domain_parts.domain
    print "Subdomains:", domain_parts.subdomains or "None"
    print "TLD:", domain_parts.tld
    

    Gives you:

    Domain: example
    Subdomains: ['sub2', 'sub1']
    TLD: co.uk
    
    0 讨论(0)
  • 2020-12-20 13:21

    Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:

    >> import tldextract
    >> tldextract.extract("http://lol1.domain.com:8888/some/page"
    ExtractResult(subdomain='lol1', domain='domain', suffix='com')
    >> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
    ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
    >> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
    ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
    

    Note that tldextract properly handles sub-domains.

    0 讨论(0)
  • 2020-12-20 13:22

    urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.

    url = urlparse.urlparse(address)
    subdomain = url.hostname.split('.')[0]
    
    0 讨论(0)
  • 2020-12-20 13:24

    A very basic approach, without any sanity checking could look like:

    address = 'http://lol1.domain.com:8888/some/page'
    
    host = address.partition('://')[2]
    sub_addr = host.partition('.')[0]
    
    print sub_addr
    

    This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:

    http://www.google.com/

    Is that what you mean?

    0 讨论(0)
  • 2020-12-20 13:35

    We can use https://github.com/john-kurkowski/tldextract for this problem...

    It's easy.

    >>> ext = tldextract.extract('http://forums.bbc.co.uk')
    >>> (ext.subdomain, ext.domain, ext.suffix)
    ('forums', 'bbc', 'co.uk')
    
    0 讨论(0)
提交回复
热议问题