In Java, how do I extract the domain of a URL?

前提是你 提交于 2019-12-11 07:59:49

问题


I'm using Java 8. I want to extract the domain portion of a URL. Just in case I'm using the word "domain" incorrectly, what i want is if my server name is

test.javabits.com

I want to extract "javabits.com". Similarly, if my server name is

firstpart.secondpart.lastpart.org

I want to extract "lastpart.org". I tried the below

final String domain = request.getServerName().replaceAll(".*\\.(?=.*\\.)", "");

but its not extracting the domain properly. Then I tried what this guy has in his site -- https://www.mkyong.com/regular-expressions/domain-name-regular-expression-example/, e.g.

private static final String DOMAIN_NAME_PATTERN = "^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$";

but that is also not extracting what I want. How can I extract the domain name portion properly?


回答1:


Summary: Do not use regex for this. Use whois.

If I try to extrapolate from your question, to find out what you really want to do, I guess you want to find the domain belonging to some non-infrastructural owner from the host part of a URL. Additionally, from the tag of your question, you want to do it with the help of a regex.

The task you are undertaking is at best impractical, but probably impossible.

There are a number of corner cases that you would have to weed out. Apart from the list of infrastructural domains kindly provided by Lennart in https://publicsuffix.org/list/public_suffix_list.dat, you also have the cases of an empty host field in the URL or an IP-address forming the host part.

So, is there a better approach to this? Of course there is. What you do want to do is query a public database for the data you need. The protocol for such queries is called WHOIS.

Apache Commons provide an easy way to access WHOIS information in the WhoisClient. From there you can query the domain field, and find some more information that may be useful to you.

It shouldn't be harder than

import org.apache.commons.net.whois.WhoisClient;
import java.io.IOException;

public class CommonsTest {
    public static void main(String args) {
        WhoisClient c = new WhoisClient();
        try {
            c.connect(WhoisClient.DEFAULT_HOST);
            System.out.println(c.query(URL));
            c.disconnect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Using this will get you the whois information aboutt he domain you are asking for. If the domain is uregistered, that is, is a private domain, as in the case of www.stackexchange.com you will get an error saying no domain is registered. Remove the first part of the address and try again. Once you found the registered domain, you will also find the registrar and the registrer.

Now, unfortunately, whois is not as simple as one would think. Read further on https://manpages.debian.org/jessie/whois/whois.1.en.html for an elaboration on how to use it and what information you can expect from different sources.

Also, check related questions here.




回答2:


try it like this:

String parts[] = longDomain.split("."); 
String domain = parts[parts.length-2] + "." + [parts.length -1];


来源:https://stackoverflow.com/questions/51634183/in-java-how-do-i-extract-the-domain-of-a-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!