问题
I have a list of million urls. I need to extract the TLD for each url and create multiple files for each TLD. For example collect all urls with .com as tld and dump that in 1 file, another file for .edu tld and so on. Further within each file, I have to sort it alphabetically by domains and then by subdomains etc.
Can anyone give me a head start for implementing this in perl?
回答1:
- Use URI to parse the URL,
- Use its
host
method to get the host, - Use Domain::PublicSuffix's
get_root_domain
to parse the host name. - Use the
tld
orsuffix
method to get the real TLD or the pseudo TLD.
use feature qw( say );
use Domain::PublicSuffix qw( );
use URI qw( );
my $dps = Domain::PublicSuffix->new();
for (qw(
http://www.google.com/
http://www.google.co.uk/
)) {
my $url = $_;
# Treat relative URLs as absolute URLs with missing http://.
$url = "http://$url" if $url !~ /^\w+:/;
my $host = URI->new($url)->host();
$host =~ s/\.\z//; # D::PS doesn't handle "domain.com.".
$dps->get_root_domain($host)
or die $dps->error();
say $dps->tld(); # com uk
say $dps->suffix(); # com co.uk
}
来源:https://stackoverflow.com/questions/8031620/extraction-of-tld-from-urls-and-sorting-domains-and-subdomains-for-each-tld-file