I need to build a function which parses the domain from a URL.
So, with
http://google.com/dhasjkdas/sadsdds/sdda/sdads.html
or
This will generally work very well if the input URL is not total junk. It removes the subdomain.
$host = parse_url( $Row->url, PHP_URL_HOST );
$parts = explode( '.', $host );
$parts = array_reverse( $parts );
$domain = $parts[1].'.'.$parts[0];
Example
Input: http://www2.website.com:8080/some/file/structure?some=parameters
Output: website.com
$domain = str_ireplace('www.', '', parse_url($url, PHP_URL_HOST));
This would return the google.com
for both http://google.com/... and http://www.google.com/...
The code that was meant to work 100% didn't seem to cut it for me, I did patch the example a little but found code that wasn't helping and problems with it. so I changed it out to a couple of functions (to save asking for the list from Mozilla all the time, and removing the cache system). This has been tested against a set of 1000 URLs and seemed to work.
function domain($url)
{
global $subtlds;
$slds = "";
$url = strtolower($url);
$host = parse_url('http://'.$url,PHP_URL_HOST);
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
foreach($subtlds as $sub){
if (preg_match('/\.'.preg_quote($sub).'$/', $host, $xyz)){
preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
}
}
return @$matches[0];
}
function get_tlds() {
$address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
$content = file($address);
foreach ($content as $num => $line) {
$line = trim($line);
if($line == '') continue;
if(@substr($line[0], 0, 2) == '/') continue;
$line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
if($line == '') continue; //$line = '.'.$line;
if(@$line[0] == '.') $line = substr($line, 1);
if(!strstr($line, '.')) continue;
$subtlds[] = $line;
//echo "{$num}: '{$line}'"; echo "<br>";
}
$subtlds = array_merge(array(
'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk',
'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au'
), $subtlds);
$subtlds = array_unique($subtlds);
return $subtlds;
}
Then use it like
$subtlds = get_tlds();
echo domain('www.example.com') //outputs: example.com
echo domain('www.example.uk.com') //outputs: example.uk.com
echo domain('www.example.fr') //outputs: example.fr
I know I should have turned this into a class, but didn't have time.
I've found that @philfreo's solution (referenced from php.net) is pretty well to get fine result but in some cases it shows php's "notice" and "Strict Standards" message. Here a fixed version of this code.
function getHost($url) {
$parseUrl = parse_url(trim($url));
if(isset($parseUrl['host']))
{
$host = $parseUrl['host'];
}
else
{
$path = explode('/', $parseUrl['path']);
$host = $path[0];
}
return trim($host);
}
echo getHost("http://example.com/anything.html"); // example.com
echo getHost("http://www.example.net/directory/post.php"); // www.example.net
echo getHost("https://example.co.uk"); // example.co.uk
echo getHost("www.example.net"); // example.net
echo getHost("subdomain.example.net/anything"); // subdomain.example.net
echo getHost("example.net"); // example.net
function get_domain($url = SITE_URL)
{
preg_match("/[a-z0-9\-]{1,63}\.[a-z\.]{2,6}$/", parse_url($url, PHP_URL_HOST), $_domain_tld);
return $_domain_tld[0];
}
get_domain('http://www.cdl.gr'); //cdl.gr
get_domain('http://cdl.gr'); //cdl.gr
get_domain('http://www2.cdl.gr'); //cdl.gr
parse_url didn't work for me. It only returned the path. Switching to basics using php5.3+:
$url = str_replace('http://', '', strtolower( $s->website));
if (strpos($url, '/')) $url = strstr($url, '/', true);