Extraction of TLD from urls and sorting domains and subdomains for each TLD file

前端 未结 1 565
南笙
南笙 2021-01-15 18:54

I have a list of million urls. I need to extract the TLD for each url and create multiple files for each TLD. For example collect all urls with .com as tld and dump that in

相关标签:
1条回答
  • 2021-01-15 19:11
    1. Use URI to parse the URL,
    2. Use its host method to get the host,
    3. Use Domain::PublicSuffix's get_root_domain to parse the host name.
    4. Use the tld or suffix method to get the real TLD or the pseudo TLD.

    use feature qw( say );
    
    use Domain::PublicSuffix qw( );
    use URI                  qw( );
    
    my $dps = Domain::PublicSuffix->new();
    
    for (qw(
       http://www.google.com/
       http://www.google.co.uk/
    )) {
       my $url = $_;
    
       # Treat relative URLs as absolute URLs with missing http://.
       $url = "http://$url" if $url !~ /^\w+:/;
    
       my $host = URI->new($url)->host();
       $host =~ s/\.\z//;  # D::PS doesn't handle "domain.com.".
    
       $dps->get_root_domain($host)
          or die $dps->error();
    
       say $dps->tld();     # com  uk
       say $dps->suffix();  # com  co.uk
    }
    
    0 讨论(0)
提交回复
热议问题