How can I extract URL and link text from HTML in Perl?

后端 未结 11 1625
一生所求
一生所求 2020-11-27 17:19

I previously asked how to do this in Groovy. However, now I\'m rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:

<
相关标签:
11条回答
  • 2020-11-27 17:50

    Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.

    Andy recommended WWW::Mechanize. That's probably the best solution.

    If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

    0 讨论(0)
  • 2020-11-27 17:53

    Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.

    HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

    0 讨论(0)
  • 2020-11-27 17:54

    Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

    0 讨论(0)
  • 2020-11-27 17:57

    Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

      my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
      my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
      while (my $node=$nodes->shift) {
        my $t=$node->attr('title');
      }
    
    0 讨论(0)
  • 2020-11-27 17:58

    We can use regular expression to extract the link with its link text. This is also the one way.

    local $/ = '';
    my $a = <DATA>;
    
    while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
    {   
        print "Link:$1 \t Text: $2\n";
    }
    
    
    __DATA__
    
    <a href="http://www.google.com">Google</a>
    
    <a href="http://www.apple.com">Apple</a>
    
    0 讨论(0)
提交回复
热议问题