implementing Public Suffix extraction using java

后端 未结 3 704
醉梦人生
醉梦人生 2020-12-17 03:04

i need to extract the top domain of an url and i got his http://publicsuffix.org/index.html

and the java implementation is in http://guava-librari

相关标签:
3条回答
  • 2020-12-17 03:16

    I recently implemented a Public Suffix List API:

    PublicSuffixList suffixList = new PublicSuffixListFactory().build();
    
    assertEquals(
        "google.com", suffixList.getRegistrableDomain("example.google.com"));
    
    assertEquals(
        "bing.com", suffixList.getRegistrableDomain("bing.bing.bing.com"));
    
    assertEquals(
        "amazon.co.jp", suffixList.getRegistrableDomain("www.amazon.co.jp"));
    
    0 讨论(0)
  • 2020-12-17 03:19

    It looks to me like InternetDomainName.topPrivateDomain() does exactly what you want. Guava maintains a list of public suffixes (based on Mozilla's list at publicsuffix.org) that it uses to determine what the public suffix part of the host is... the top private domain is the public suffix plus its first child.

    Here's a quick example:

    public class Test {
      public static void main(String[] args) throws URISyntaxException {
        ImmutableList<String> urls = ImmutableList.of(
            "http://example.google.com", "http://google.com", 
            "http://bing.bing.bing.com", "http://www.amazon.co.jp/");
        for (String url : urls) {
          System.out.println(url + " -> " + getTopPrivateDomain(url));
        }
      }
    
      private static String getTopPrivateDomain(String url) throws URISyntaxException {
        String host = new URI(url).getHost();
        InternetDomainName domainName = InternetDomainName.from(host);
        return domainName.topPrivateDomain().name();
      }
    }
    

    Running this code prints:

    http://example.google.com -> google.com
    http://google.com -> google.com
    http://bing.bing.bing.com -> bing.com
    http://www.amazon.co.jp/ -> amazon.co.jp
    0 讨论(0)
  • 2020-12-17 03:34

    EDIT: Sorry I've been a little too fast. I didn't think of co.jp. co.uk, and so on. You will need to get a list of possible TLDs from somewhere. You could also take a look at http://commons.apache.org/validator/ to validate a TLD.

    I think something like this should work: But maybe there exists some Java-Standard Function.

    String url = "http://www.foobar.com/someFolder/index.html";
    if (url.contains("://")) {
      url = url.split("://")[1];
    }
    
    if (url.contains("/")) {
      url = url.split("/")[0];
    }
    
    // You need to get your TLDs from somewhere...
    List<String> magicListofTLD = getTLDsFromSomewhere();
    
    int positionOfTLD = -1;
    String usedTLD = null;
    for (String tld : magicListofTLD) {
      positionOfTLD = url.indexOf(tld);
      if (positionOfTLD > 0) {
        usedTLD = tld;
        break;
      }
    }
    
    if (positionOfTLD > 0) {
      url = url.substring(0, positionOfTLD);
    } else {
      return;
    }
    String[] strings = url.split("\\.");
    
    String foo = strings[strings.length - 1] + "." + usedTLD;
    System.out.println(foo);
    
    0 讨论(0)
提交回复
热议问题