How do I parse an HTML website using Perl? [closed]

谁说我不能喝 提交于 2021-02-04 16:20:07

问题


Could you please give me some suggestions on how to parse HTML in Perl? I plan to parse the keywords (including URL links) and save them to a MySQL database. I am using Windows XP.

Also, do I first need to download some website pages to the local hard drive with some offline Explorer tool? If I do, could you point me to a good download tool?


回答1:


You can use LWP to retrieve the pages you need to parse. There are many ways to go about parsing the HTML. You can use regular expressions to find links and keywords (though it isn't usually a good practice), or modules like HTML::TokeParser or HTML::TreeBuilder.




回答2:


You can use one of many HTML parser modules. If you're familiar with jQuery, the pQuery module would be a good choice, as it ports most of the easy-to-use features of jQuery to Perl for HTML parsing and scraping.




回答3:


The HTTrack website copier/downloader has many more features than any available Perl library.




回答4:


To traverse and save locally an entire website you could use wget -r -np http://localhost/manual/ (wget is available on Windows, standalone or part of Cygwin/MinGW). However, if you want to both traverse and scrape data, Mojolicious can be used to build a simple parallel web crawler, very light on dependencies:

#!/usr/bin/env perl
use feature qw(say);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = (Mojo::URL->new('http://localhost/manual/'));

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent->new(max_redirects => 5);

# Track accessed URLs
my %uniq;

my $active = 0;
Mojo::IOLoop->recurring(
    0 => sub {

        # Keep up to 4 parallel crawlers sharing the same user agent
        for ($active .. 4 - 1) {

            # Dequeue or halt if there are no active crawlers anymore
            return ($active or Mojo::IOLoop->stop) unless my $url = shift @urls;

            # Fetch non-blocking just by adding a callback and marking as active
            ++$active;
            $ua->get(
                $url => sub {
                    my (undef, $tx) = @_;

                    say "\n$url";
                    say $tx->res->dom->at('html title')->text;

                    # Extract and enqueue URLs
                    for my $e ($tx->res->dom('a[href]')->each) {
                        # Validate href attribute
                        my $link = Mojo::URL->new($e->{href});
                        next if 'Mojo::URL' ne ref $link;

                        # "normalize" link
                        $link = $link->to_abs($tx->req->url)->fragment(undef);
                        next unless $link->protocol =~ /^https?$/x;

                        # Access every link once
                        next if ++$uniq{$link->to_string} > 1;

                        # Don't visit other hosts
                        next if $link->host ne $url->host;

                        push @urls, $link;
                        say " -> $link";
                    }

                    # Deactivate
                    --$active;
                }
            );
        }
    }
);

# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;


来源:https://stackoverflow.com/questions/2748185/how-do-i-parse-an-html-website-using-perl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!