Extract main domain name from a given url

大城市里の小女人 提交于 2019-11-30 07:27:30

问题


I used the following to extract the domain from a url: (They are test cases)

String regex = "^(ww[a-zA-Z0-9-]{0,}\\.)";
ArrayList<String> cases = new ArrayList<String>();
cases.add("www.google.com");
cases.add("ww.socialrating.it");
cases.add("www-01.hopperspot.com");
cases.add("wwwsupernatural-brasil.blogspot.com");
cases.add("xtop10.net");
cases.add("zoyanailpolish.blogspot.com");

for (String t : cases) {  
    String res = t.replaceAll(regex, "");  
}

I can get the following results:

google.com
hopperspot.com
socialrating.it
blogspot.com
xtop10.net
zoyanailpolish.blogspot.com

The first four cases are good. The last one is not good. What I want is: blogspot.com for the last one, but it gives zoyanailpolish.blogspot.com. What am I doing wrong?


回答1:


As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:

    String url = "zoyanailpolish.blogspot.com";

    String tld = findTLD( url ); // To be implemented. Add to helper class ?

    url = url.replace( "." + tld,"");  

    int pos = url.lastIndexOf('.');

    String mainDomain = "";

    if (pos > 0 && pos < url.length() - 1) {
        mainDomain = url.substring(pos + 1) + "." + tld;
    }
    // else: Main domain name comes out empty

The implementation details are left up to you.




回答2:


Using Guava library, we can easily get domain name:

InternetDomainName.from(tld).topPrivateDomain()

Refer API link for more details

https://google.github.io/guava/releases/14.0/api/docs/

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html




回答3:


Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time.

You should use instead the functionality provided by JAVA library like this:

URL myUrl = new URL(urlString);
myUrl.getHost();



回答4:


This is 2013 and solution I found is straight forward:

System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());



回答5:


It is much simpler:

  try {
        String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();

        String[] levels = domainName.split("\\.");
        if (levels.length > 1)
        {
            domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
        }

        // now value of domainName variable is blogspot.com
    } catch (Exception e) {}



回答6:


The reason why your are seeing zoyanailpolish.blogspot.com is that your regex finds only strings that start with a 'ww'. What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\.)"; This will remove any word that starts with a 'ww' or 'z' or 'a'. Customize it based on what you need exactly.




回答7:


InternetDomainName.from("test.blogspot.com").topPrivateDomain() -> test.blogspot.com

This works better in my case:

InternetDomainName.from("test.blogspot.com").topDomainUnderRegistrySuffix() -> blogspot.com

Details: https://github.com/google/guava/wiki/InternetDomainNameExplained



来源:https://stackoverflow.com/questions/7217271/extract-main-domain-name-from-a-given-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!