How to get the domainname (name+TLD) from a URL in python

后端 未结 4 624
温柔的废话
温柔的废话 2021-01-14 04:51

I want to extract the domain name(name of the site+TLD) from a list of URLs which may vary in their format. for instance: Current state---->what I want

mai         


        
相关标签:
4条回答
  • 2021-01-14 05:13

    This is somewhat non-trivial, as there is no simple rule to determine what makes a for a valid public suffix (site name + TLD). Instead, what makes a public suffix is maintained as a list at PublicSuffix.org.

    A python package exists that queries that list (stored locally); it's called publicsuffix:

    >>> from publicsuffix import PublicSuffixList
    >>> psl = PublicSuffixList()
    >>> print psl.get_public_suffix('mail.yahoo.com')
    yahoo.com
    >>> print psl.get_public_suffix('account.hotmail.co.uk')
    hotmail.co.uk
    
    0 讨论(0)
  • 2021-01-14 05:15

    At this time I see six packages doing domain name splitting:

    • https://pypi.python.org/pypi/tldextract
    • https://pypi.python.org/pypi/tld
    • https://pypi.python.org/pypi/publicsuffixlist
    • https://pypi.python.org/pypi/publicsuffix
    • https://pypi.python.org/pypi/publicsuffix2
    • https://pypi.python.org/pypi/dnspy

    They differ in the way they cache Public Suffix List data (only tldextract uses a JSON file, thereby sparing to parse the list on loading), in the strategy used to download that data, and in the structure they keep in memory (respectively: frozenset, set, set, dictionaries of labels, ditto, dictionary of names) which determines the search algorithm.

    0 讨论(0)
  • 2021-01-14 05:21

    Using python tld

    https://pypi.python.org/pypi/tld

    $ pip install tld

    from tld import get_tld
    print get_tld("http://www.google.co.uk/some-page/some-sub-page/")
    'google.co.uk'
    
    0 讨论(0)
  • 2021-01-14 05:35

    There is a public list of TLD and CC TLD that is maintained.

    This python project reads this list and compares your URL against this list.

    https://github.com/john-kurkowski/tldextract
    
    0 讨论(0)
提交回复
热议问题