Need regex to get domain + subdomain

问题

So im using this function here:

function get_domain($url)
{
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }
  return false;
}

$referer = get_domain($_SERVER['HTTP_REFERER']);

And what i need is another regex for it, if someone would be so kind to help. Exactly what i need is for it to get the whole domain, including subdomains.

Lets say as a real problem i have now. When people blogging link from example: myblog.blogger.com The referer url will be just blogger.com, which is not ideal..

So if someone could help me so i can get the including subdomain as regex code for the function above, id apreciate it alot!

Thanks!

回答1:

This regex should match a domain in a string, including any dubdomains:

/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/

Translated to rough english, it functions like this: "match the first part of the string that has 'sometextornumbers.sometext', and also include any number of 'sometextornumbers.' that might preceed it.

See it in action here: http://regexr.com?2vppk

Note that the multiline and global flags in that link are only there to be able to match the entire blob of test-text, so you don't need if you're passing only one line to the regex

回答2:

Good luck with the above as Domain names now contain non-roman characters. These would have to be processed into equivalent but unique ascii before regex could work reliably. See RFC 3490 Internationalizing Domain Names in Applications (IDNA) ... See https://tools.ietf.org/html/rfc3490 which has

Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. This document defines
internationalized domain names (IDNs) and a mechanism called
Internationalizing Domain Names in Applications (IDNA) for handling
them in a standard fashion. IDNs use characters drawn from a large
repertoire (Unicode), but IDNA allows the non-ASCII characters to be
represented using only the ASCII characters already allowed in so-
called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.

回答3:

Better solution:

/^([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/

Regex sample: https://regexr.com/4k71a

And for email address:

/^[a-z0-9|.|-]+[a-z0-9]{1,}@([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/

来源：https://stackoverflow.com/questions/8959765/need-regex-to-get-domain-subdomain

标签

regex

dns

subdomain