How do I parse an HTML website using Perl? [closed]

问题

Could you please give me some suggestions on how to parse HTML in Perl? I plan to parse the keywords (including URL links) and save them to a MySQL database. I am using Windows XP.

Also, do I first need to download some website pages to the local hard drive with some offline Explorer tool? If I do, could you point me to a good download tool?

回答1:

You can use LWP to retrieve the pages you need to parse. There are many ways to go about parsing the HTML. You can use regular expressions to find links and keywords (though it isn't usually a good practice), or modules like HTML::TokeParser or HTML::TreeBuilder.

回答2:

You can use one of many HTML parser modules. If you're familiar with jQuery, the pQuery module would be a good choice, as it ports most of the easy-to-use features of jQuery to Perl for HTML parsing and scraping.

回答3:

The HTTrack website copier/downloader has many more features than any available Perl library.

回答4:

To traverse and save locally an entire website you could use wget -r -np http://localhost/manual/ (wget is available on Windows, standalone or part of Cygwin/MinGW). However, if you want to both traverse and scrape data, Mojolicious can be used to build a simple parallel web crawler, very light on dependencies:

#!/usr/bin/env perl
use feature qw(say);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = (Mojo::URL->new('http://localhost/manual/'));

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent->new(max_redirects => 5);

# Track accessed URLs
my %uniq;

my $active = 0;
Mojo::IOLoop->recurring(
    0 => sub {

        # Keep up to 4 parallel crawlers sharing the same user agent
        for ($active .. 4 - 1) {

            # Dequeue or halt if there are no active crawlers anymore
            return ($active or Mojo::IOLoop->stop) unless my $url = shift @urls;

            # Fetch non-blocking just by adding a callback and marking as active
            ++$active;
            $ua->get(
                $url => sub {
                    my (undef, $tx) = @_;

                    say "\n$url";
                    say $tx->res->dom->at('html title')->text;

                    # Extract and enqueue URLs
                    for my $e ($tx->res->dom('a[href]')->each) {
                        # Validate href attribute
                        my $link = Mojo::URL->new($e->{href});
                        next if 'Mojo::URL' ne ref $link;

                        # "normalize" link
                        $link = $link->to_abs($tx->req->url)->fragment(undef);
                        next unless $link->protocol =~ /^https?$/x;

                        # Access every link once
                        next if ++$uniq{$link->to_string} > 1;

                        # Don't visit other hosts
                        next if $link->host ne $url->host;

                        push @urls, $link;
                        say " -> $link";
                    }

                    # Deactivate
                    --$active;
                }
            );
        }
    }
);

# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

来源：https://stackoverflow.com/questions/2748185/how-do-i-parse-an-html-website-using-perl

标签

perl

html-parsing