How can I extract URL and link text from HTML in Perl?

后端未结

关注

 11  1625

I previously asked how to do this in Groovy. However, now I\'m rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:

相关标签:

11条回答

温柔的废话

2020-11-27 17:50

Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.

Andy recommended WWW::Mechanize. That's probably the best solution.

If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-11-27 17:53

Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.

HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-11-27 17:54

Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-11-27 17:57
Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
```
  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

滥情空心

2020-11-27 17:58

We can use regular expression to extract the link with its link text. This is also the one way.

local $/ = '';
my $a = <DATA>;

while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{   
    print "Link:$1 \t Text: $2\n";
}


__DATA__

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

0 讨论(0)

上一页 1 2