How to get the domainname (name+TLD) from a URL in python

我怕爱的太早我们不能终老 提交于 2019-12-04 01:42:27

问题


I want to extract the domain name(name of the site+TLD) from a list of URLs which may vary in their format. for instance: Current state---->what I want

mail.yahoo.com------> yahoo.com
account.hotmail.co.uk---->hotmail.co.uk
x.it--->x.it
google.mail.com---> google.com

Is there any python code that can help me with extracting what I want from URL or should I do it manually?


回答1:


This is somewhat non-trivial, as there is no simple rule to determine what makes a for a valid public suffix (site name + TLD). Instead, what makes a public suffix is maintained as a list at PublicSuffix.org.

A python package exists that queries that list (stored locally); it's called publicsuffix:

>>> from publicsuffix import PublicSuffixList
>>> psl = PublicSuffixList()
>>> print psl.get_public_suffix('mail.yahoo.com')
yahoo.com
>>> print psl.get_public_suffix('account.hotmail.co.uk')
hotmail.co.uk



回答2:


There is a public list of TLD and CC TLD that is maintained.

This python project reads this list and compares your URL against this list.

https://github.com/john-kurkowski/tldextract



回答3:


Using python tld

https://pypi.python.org/pypi/tld

$ pip install tld

from tld import get_tld
print get_tld("http://www.google.co.uk/some-page/some-sub-page/")
'google.co.uk'



回答4:


At this time I see six packages doing domain name splitting:

  • https://pypi.python.org/pypi/tldextract
  • https://pypi.python.org/pypi/tld
  • https://pypi.python.org/pypi/publicsuffixlist
  • https://pypi.python.org/pypi/publicsuffix
  • https://pypi.python.org/pypi/publicsuffix2
  • https://pypi.python.org/pypi/dnspy

They differ in the way they cache Public Suffix List data (only tldextract uses a JSON file, thereby sparing to parse the list on loading), in the strategy used to download that data, and in the structure they keep in memory (respectively: frozenset, set, set, dictionaries of labels, ditto, dictionary of names) which determines the search algorithm.



来源:https://stackoverflow.com/questions/15460777/how-to-get-the-domainname-nametld-from-a-url-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!