Perl Regex to get the root domain of a URL

后端 未结 6 1909
不思量自难忘°
不思量自难忘° 2021-01-13 17:40

How could I get some part of url?

For example:

http://www.facebook.com/xxxxxxxxxxx
http://www.stackoverflow.com/yyyyyyyyyyyyyyyy

I

相关标签:
6条回答
  • 2021-01-13 18:00
    $a="http://www.stackoverflow.com/yyyyyyyyyyyyyyyy";
    if($a=~/\/\/\w+\.(.*)\// )
    {   print $1; }
    else
    { print "false";  }
    
    0 讨论(0)
  • 2021-01-13 18:01

    I like the URI answer. The OP requested a regex, so in honor of the request and as a challenge, here is the answer I came up with. To be fair, sometimes it is not easy or feasible to install a CPAN modules. I have worked on some projects that are hardened using a very specific version of Perl and only certain modules are allowed.

    Here is my attempt at the regex answer. Note that the www. is optional. Sub-domains like mobile. are honored. The search for / is not greedy therefore a URL with directories on the end will be parsed correctly. I am not dependent on the protocol; it could be http, https, file, sftp whatever. The output is captured in $1.

    ^.*://(?:[wW]{3}\.)?([^:/]*).*$
    

    Sample input:

    http://WWW.facebook.com:80/
    http://facebook.com/xxxxxxxxxxx/aaaaa
    http://www.stackoverflow.com/yyyyyyyyyyyyyyyy/aaaaaaa
    https://mobile.yahoo.com/yyyyyyyyyyyyyyyy/aaaaaaa
    http://www.theregister.co.uk/
    

    Sample output:

    facebook.com
    facebook.com
    stackoverflow.com
    mobile.yahoo.com
    theregister.co.uk
    

    EDIT: Thanks @ikegami for the extra challenge. :) Now it supports WWW in any mixed case and a port number like :80.

    0 讨论(0)
  • 2021-01-13 18:14

    Just some simple regex stuff.

    $facebook = "www.facebook.com/xxxxxxxxxxx";
    
    $facebook =~ s/www\.(.*\.com).*/$1/; # get what is between www. and .com
    
    print $facebook;
    

    Returns

    facebook.com
    

    You may also want to make this work for .net, .org, etc. Something like:

    s/www\.(.*\.(?:net|org|com)).*/$1/;
    
    0 讨论(0)
  • 2021-01-13 18:16

    I found a way:

    my @urls = qw( http://www.facebook.com http://www.sadas.com/ );
    for my $url (@urls) {
       $url =~ s/^https?:(?:www\.)?//ig;
       $url =~ s{/.*}{};
       print "$url\n";
    }
    
    0 讨论(0)
  • 2021-01-13 18:18

    This Might be helpful...

    ^https?:\/\/www\.([\da-zA-Z\.-]+)

    Sample Input:

    http://www.banglanews24.com/detailsnews.php
    nssl=763daee77dc90b1c1baf0a361be2ff3c&nttl=20130416072403189462
    
    http://www.prothom-alo.com/detail/date/2013-04-20/news/3463
    
    http://www.facebook.com/xxxxxxxxxxx
    
    http://www.stackoverflow.com/yyyyyyyyyyyyyyy
    

    Sample output:

    banglanews24.com
    
    prothom-alo.com
    
    facebook.com
    
    stackoverflow.com
    
    0 讨论(0)
  • 2021-01-13 18:24
    use feature qw( say state );
    
    use Domain::PublicSuffix qw( );
    use URI                  qw( );
    
    # Returns "domain.tld" for "subdomain.domain.tld". 
    # Handles multi-level TLDs such as ".co.uk".
    sub root_domain {
       my ($domain) = @_;
       state $parser = Domain::PublicSuffix->new();
       return $parser->get_root_domain($domain);
    }
    
    # Accepts urls as strings and as URI objects.
    sub url_root_domain {
       my ($abs_url) = @_;
       my $domain = URI->new($abs_url)->host();
       return root_domain($domain);
    }
    
    say url_root_domain('http://www.facebook.com/');       # facebook.com
    say url_root_domain('https://www.facebook.com/');      # facebook.com
    say url_root_domain('http://mobile.google.com/');      # google.com
    say url_root_domain('http://www.theregister.co.uk/');  # theregister.co.uk
    say url_root_domain('http://www.com/');                # www.com
    
    0 讨论(0)
提交回复
热议问题