Extract main domain name from a given url

若如初见. 提交于 2019-11-29 03:57:09

As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:

    String url = "zoyanailpolish.blogspot.com";

    String tld = findTLD( url ); // To be implemented. Add to helper class ?

    url = url.replace( "." + tld,"");  

    int pos = url.lastIndexOf('.');

    String mainDomain = "";

    if (pos > 0 && pos < url.length() - 1) {
        mainDomain = url.substring(pos + 1) + "." + tld;
    }
    // else: Main domain name comes out empty

The implementation details are left up to you.

Satya

Using Guava library, we can easily get domain name:

InternetDomainName.from(tld).topPrivateDomain()

Refer API link for more details

https://google.github.io/guava/releases/14.0/api/docs/

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html

Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time.

You should use instead the functionality provided by JAVA library like this:

URL myUrl = new URL(urlString);
myUrl.getHost();

This is 2013 and solution I found is straight forward:

System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());

It is much simpler:

  try {
        String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();

        String[] levels = domainName.split("\\.");
        if (levels.length > 1)
        {
            domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
        }

        // now value of domainName variable is blogspot.com
    } catch (Exception e) {}

The reason why your are seeing zoyanailpolish.blogspot.com is that your regex finds only strings that start with a 'ww'. What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\.)"; This will remove any word that starts with a 'ww' or 'z' or 'a'. Customize it based on what you need exactly.

InternetDomainName.from("test.blogspot.com").topPrivateDomain() -> test.blogspot.com

This works better in my case:

InternetDomainName.from("test.blogspot.com").topDomainUnderRegistrySuffix() -> blogspot.com

Details: https://github.com/google/guava/wiki/InternetDomainNameExplained

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!