In Java, how do I extract the domain of a URL?

问题

I'm using Java 8. I want to extract the domain portion of a URL. Just in case I'm using the word "domain" incorrectly, what i want is if my server name is

test.javabits.com

I want to extract "javabits.com". Similarly, if my server name is

firstpart.secondpart.lastpart.org

I want to extract "lastpart.org". I tried the below

final String domain = request.getServerName().replaceAll(".*\\.(?=.*\\.)", "");

but its not extracting the domain properly. Then I tried what this guy has in his site -- https://www.mkyong.com/regular-expressions/domain-name-regular-expression-example/, e.g.

private static final String DOMAIN_NAME_PATTERN = "^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$";

but that is also not extracting what I want. How can I extract the domain name portion properly?

回答1:

Summary: Do not use regex for this. Use whois.

If I try to extrapolate from your question, to find out what you really want to do, I guess you want to find the domain belonging to some non-infrastructural owner from the host part of a URL. Additionally, from the tag of your question, you want to do it with the help of a regex.

The task you are undertaking is at best impractical, but probably impossible.

There are a number of corner cases that you would have to weed out. Apart from the list of infrastructural domains kindly provided by Lennart in https://publicsuffix.org/list/public_suffix_list.dat, you also have the cases of an empty host field in the URL or an IP-address forming the host part.

So, is there a better approach to this? Of course there is. What you do want to do is query a public database for the data you need. The protocol for such queries is called WHOIS.

Apache Commons provide an easy way to access WHOIS information in the WhoisClient. From there you can query the domain field, and find some more information that may be useful to you.

It shouldn't be harder than

import org.apache.commons.net.whois.WhoisClient;
import java.io.IOException;

public class CommonsTest {
    public static void main(String args) {
        WhoisClient c = new WhoisClient();
        try {
            c.connect(WhoisClient.DEFAULT_HOST);
            System.out.println(c.query(URL));
            c.disconnect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Using this will get you the whois information aboutt he domain you are asking for. If the domain is uregistered, that is, is a private domain, as in the case of www.stackexchange.com you will get an error saying no domain is registered. Remove the first part of the address and try again. Once you found the registered domain, you will also find the registrar and the registrer.

Now, unfortunately, whois is not as simple as one would think. Read further on https://manpages.debian.org/jessie/whois/whois.1.en.html for an elaboration on how to use it and what information you can expect from different sources.

Also, check related questions here.

回答2:

try it like this:

String parts[] = longDomain.split("."); 
String domain = parts[parts.length-2] + "." + [parts.length -1];

来源：https://stackoverflow.com/questions/51634183/in-java-how-do-i-extract-the-domain-of-a-url

标签

java

regex

string

parsing

subdomain