Fully qualified domain name validation

十年热恋 提交于 2019-11-27 07:00:28
John Nagle

It's harder nowadays, with internationalized domain names and several thousand (!) new TLDs.

The easy part is that you can still split the components on ".".

You need a list of registerable TLDs. There's a site for that:

https://publicsuffix.org/list/effective_tld_names.dat

You only need to check the ICANN-recognized ones. Note that a registerable TLD can have more than one component, such as "co.uk".

Then there's IDN and punycode. Domains are Unicode now. For example,

"xn--nnx388a" is equivalent to "臺灣". Both of those are valid TLDs, incidentally.

For punycode conversion code, see "http://golang.org/src/pkg/net/http/cookiejar/punycode.go".

Checking the syntax of each domain component has new rules, too. See RFC5890 at http://tools.ietf.org/html/rfc5890

Components can be either A-labels (ASCII only) or Unicode. ASCII labels either follow the old syntax, or begin "xn--", in which case they are a punycode version of a Unicode string.

The rules for Unicode are very complex, and are given in RFC5890. The rules are designed to prevent such things as mixing characters from left-to-right and right-to-left sets.

Sorry there's no easy answer.

bkr
(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}$)

regex is always going to be at best an approximation for things like this, and rules change over time. the above regex was written with the following in mind and is specific to hostnames-

Hostnames are composed of a series of labels concatenated with dots. Each label is 1 to 63 characters long, and may contain:

  • the ASCII letters a-z (in a case insensitive manner),
  • the digits 0-9,
  • and the hyphen ('-').

Additionally:

some assumptions:

  • TLD is at least 2 characters and only a-z
  • we want at least 1 level above TLD

results: valid / invalid

  • 911.gov - valid
  • 911 - invalid (no TLD)
  • a-.com - invalid
  • -a.com - invalid
  • a.com - valid
  • a.66 - invalid
  • my_host.com - invalid (undescore)
  • typical-hostname33.whatever.co.uk - valid

EDIT: John Rix provided an alternative hack of the regex to make the specification of a TLD optional:

(?=^.{1,253}$)(^(((?!-)[a-zA-Z0-9-]{1,63}(?<!-))|((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63})$)
  • 911 - valid
  • 911.gov - valid

EDIT 2: someone asked for a version that works in js. the reason it doesn't work in js is because js does not support regex look behind. specifically, the code (?<!-) - which specifies that the previous character cannot be a hyphen.

anyway, here it is rewritten without the lookbehind - a little uglier but not much

(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63}$)

you could likewise make a similar replacement on John Rix's version.

EDIT 3: if you want to allow trailing dots - which is technically allowed:

(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}\.?$)

I wasn't familiar with trailing dot syntax till @ChaimKut pointed them out and I did some research

Using trailing dots however seems to cause somewhat unpredictable results in the various tools I played with so I would be advise some caution.

tombolinux

This regex is what you want:

(?=^.{1,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)

It match your example domain (groupa-zone1appserver.example.com or cod.eu etc...)

I'll try to explain:

(?=^.{1,254}$) matches domain names (that can begin with any char) that are long between 1 and 254 char, it could be also 5,254 if we assume co.uk is the minimum length.

(^ starting match

(?: define a matching group

(?!\d+\.) the domain name should not be composed by numbers, so 1234.co.uk or abc.123.uk aren't accepted while 1a.ko.uk yes.

[a-zA-Z0-9_\-] the domain names should be composed by words with only a-zA-Z0-9_-

{1,63} the length of any domain level is maximum 63 char, (it could be 2,63)

+ and

(?:[a-zA-Z]{2,})$) the final part of the domain name should not be followed by any other word and must be composed of a word minimum of 2 char a-zA-Z

CONSIDERATION #1:

Please note that due to relaxed requirements in RFC-2181 DNS labels can consist of pretty much any combination of symbols (however, the length restrictions are still there):

"Any binary string whatever can be used as the label of any resource record. Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs." (https://tools.ietf.org/html/rfc2181#section-11)

CONSIDERATION #2:

"There is an additional rule that essentially requires that top-level domain names not be all-numeric" (https://tools.ietf.org/html/rfc3696#section-2)

Taking into account these two considerations, the correct regex looks like this:

/^(?!:\/\/)(?=.{1,255}$)((.{1,63}\.){1,127}(?![0-9]*$)[a-z0-9-]+\.?)$/i

See demo @ http://regexr.com/3g5j0

The following expression

(^((?=^.{4,253}$)(((http){0,1}|(http){0,1}|(ftp){0,1}|(ws){0,1})(s{0,1}):\/\/){0,1})((((?!-)[\pL0-9\-]{1,63})(?<!-)(\.)){1,})(((?!-)[a-z0-9\-]{1,63})(?<!-)((\/{0,1}[\pL\pN?=\-]*)+){1})$)

will match

https://www.tes1t.com/lets/to?878932572
https://www.test.co.uk/lets/to?878932572
http://www.test.com/lets/to?878932572
http://www.test.co.uk/lets/to?878932572
ftp://www.test.com/lets/to?878932572
subdomain.test.com/lets/to?878932572
subdomain.test.com/lets/to?878932572
subdomain.subdomain.test.net/lets/to?878932572

sub-domain.test.net/lets/to?878932572
sub-domain.test.net/lets-go/to?878932572
www.test.net/lets/to?878932572
www.test-test.com/
www.test-test.com

subdomain.subdomainsubdomainsuèdomainsubdomainsubdomainsubdomainsubdomain.net/let2s/to?=878932572

www.test-test.co.uk
http://www.test-test-.com/test
www.test-teèst.co.uk/lets
www.test-test.co.uk/lets/
www.test-test.co.uk/lets/to?
test-test.co.uk/lets/to?
test-test.co.uk/lets/
test-test.co.uk/lets
test-test.co.uk
http://test.com/lets/to?878932572
https://test.com/lets/to?878932572
ftp://test.com/lets/to?878932572
ftps://test.com/lets/to?878932572
ws://test.com/lets/to?878932572aa
wss://test.com/lets/to?=878932572bar
test.com

subdomain.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.test.khbdomainsubdomainsubdomain.test.net/lets/to?87893257

but not match:

www.-test-fail-.com
www.-test-fail.com
-test-fail.com
test-fail-.com

subdomain.subdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomainubdomainsubdomainsubdomain.test.net/lets/to?878932572

subdomain.subdomainsubdomainsubdcnvcnvcnofhfhghgfhvnhj-mainsubdomainsubdohhghghghfhgffgjh-gfhfdhfdghmainsubdocgvhngvnbnbmghghghaihgfjgfnfhfdghgsufghgghghhdfjgffsgfbdomainsubdomainsubdomainsubdomainsubdomainsubdomainsubdomain.test.net/lets/to?878932572

subdomain.test.test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test..test.khbdomainsubdomainsubdomain.test.net/lets/to?87893257

We use this regex to validate domains which occur in the wild. It covers all practical use cases I know of. New ones are welcome. According to our guidelines it avoids non-capturing groups and greedy matching.

^(?!.*?_.*?)(?!(?:[\d\w]+?\.)?\-[\w\d\.\-]*?)(?![\w\d]+?\-\.(?:[\d\w\.\-]+?))(?=[\w\d])(?=[\w\d\.\-]*?\.+[\w\d\.\-]*?)(?![\w\d\.\-]{254})(?!(?:\.?[\w\d\-\.]*?[\w\d\-]{64,}\.)+?)[\w\d\.\-]+?(?<![\w\d\-\.]*?\.[\d]+?)(?<=[\w\d\-]{2,})(?<![\w\d\-]{25})$

Proof and explanation: https://regex101.com/r/FLA9Bv/9

There're two approaches to choose from when validating domains.

By-the-books FQDN matching (theoretical definition, rarely encountered in practice):

Practical / conservative FQDN matching (practical definition, expected and supported in practice):

  • by-the-books matching with the following exceptions/additions
  • valid characters: [a-zA-Z0-9.-]
  • labels cannot start or end with hyphens (as per RFC-952 and RFC-1123/2.1)
  • TLD min length is 2 character, max length is 24 character as per currently existing records
  • don't match trailing dot
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!