How can I extract URL and link text from HTML in Perl?

后端未结

关注

 11  1624

I previously asked how to do this in Groovy. However, now I\'m rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:

相关标签:

11条回答

春和景丽

2020-11-27 17:34
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
```
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}
```
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

Mech is basically a browser in an object.
0 讨论(0)
发布评论:

提交评论
- 加载中...

说谎

2020-11-27 17:35

Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…

XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.

use XML::LibXML;

my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\@href]") )
{
    printf "%15s -> %s\n",
        $anchor->textContent,
        $anchor->getAttribute("href");
}

__DATA__
<html><head><title/></head><body>
<a href="http://www.google.com">Google</a>
<a href="http://www.apple.com">Apple</a>
</body></html>

–yields–

     Google -> http://www.google.com
      Apple -> http://www.apple.com

0 讨论(0)

醉话见心

2020-11-27 17:40

HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.

0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-11-27 17:42
If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
```
#!/usr/bin/perl

if($#ARGV < 0) {
  print "$0: Need URL argument.\n";
  exit 1;
}

my @content = split(/\n/,`wget -qO- $ARGV[0]`);
my @links = grep(/<a.*href=.*>/,@content);

foreach my $c (@links){
  $c =~ /<a.*href="([\s\S]+?)".*>/;
  $link = $1;
  $c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
  $title = $1;
  print "$title, $link\n";
}
```
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2020-11-27 17:42
I like using pQuery for things like this...
```
use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);
```
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.
0 讨论(0)
发布评论:

提交评论
- 加载中...

眼角桃花

2020-11-27 17:48

HTML::LinkExtractor is better than HTML::LinkExtor

It can give both link text and URL.

Usage:

 use HTML::LinkExtractor;
 my $input = q{If <a href="http://apple.com/"> Apple </a>}; #HTML string
 my $LX = new HTML::LinkExtractor(undef,undef,1);
 $LX->parse(\$input);
 for my $Link( @{ $LX->links } ) {
        if( $$Link{_TEXT}=~ m/Apple/ ) {
            print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
        }
    }

0 讨论(0)

1 2 下一页