I have some domains I want to split but can\'t figure out the regex...
I have:
http://www.google.com/tomato
http://int.google.c
$res = preg_replace( "/^(http:\/\/)([a-z_\-]+\.)*([a-z_\-]+)\.(com|co.uk|net)\/.*$/im", "\$3", $in );
Add as much endings as you know
Edit: made a mistake :-(
why you trying to use regex ? there's plenty of native functions available for you, such as:
$host = parse_url($url, PHP_URL_HOST);
update, give this a go, it may need improving but its better than Regex imo
function determainDomainName($url)
{
$hostname = parse_url($url, PHP_URL_HOST);
$parts = explode(".",$hostname);
switch(count($parts))
{
case 1:
return $parts[0]; //has to be a .com etc
break;
case 2:
if($parts[1] == "www") //The most common subdomain
{
return $parts[2]; //Bypass Subdomain / return next segment
}
if($parts[2] == "co") //Possible in_array here for multiples, but first segment of double barrel tld
{
return $parts[1]; //Bypass double barrel tld's
}
break;
default:
//Have a guess
//I bet the longest word is the domain :)
usort($parts,"mysort");
return $parts[0];
/*
here we just order the array by the longest word
so google will always come above the following
com,co,uk,www,cdn,ww1,ww2 etc
*/
break;
}
}
function mysort($a,$b){
return strlen($b) - strlen($a);
}
Add the following 2 functions to your libraries etc.
Then use like so:
$urls = array(
'http://www.google.com/tomato',
'http://int.google.com',
'http://google.co.uk'
);
foreach($urls as $url)
{
echo determainDomainName($url) . "\n";
}
They will all echo google
see @ http://codepad.org/pA5KWckb
You can do this on a best bet basis. The last part of the URL is always the TLD (and optional root). And you are basically looking for any preceeding word that is longer than 2 letters:
$url = "http://www.google.co.uk./search?q=..";
preg_match("#http://
(?:[^/]+\.)* # cut off any preceeding www*
([\w-]{3,}) # main domain name
(\.\w\w)? # two-letter second level domain .co
\.\w+\.? # TLD
(/|:|$) # end regex with / or : or string end
#x",
$url, $match);
If you expect any longer second-level domains (.com maybe?) then add another \w
. But this is not very generic, you would actually need a list for TLDs were this was allowed.
The answer here might be what you're looking for.
Getting parts of a URL (Regex)