Parsing HTML with Mojolicious User Agent

问题

I have html something like this

 <h1>My heading</h1>

 <p class="class1">
 <strong>SOMETHING</strong> INTERESTING (maybe not).
 </p>

 <div class="mydiv">
 <p class="class2">
 <a href="http://www.link.com">interesting link</a> </p>

 <h2>Some other heading</h2>

The content between h1 and h2 varies - I know I can use css selectors in Mojo::Dom to, say, select the content of h1 or h2, or p tags - but how to select everything between h1 and h2? Or more generally, everything between any two given sets of tags?

回答1:

It's pretty straightforward. You can just select all interesting elements in a Mojo::Collection object (this is what Mojo::DOM's children method does for example) and do some kind of a state-machine like match while iterating over that collection.

Probably the most magic way to do this

is to use Perl's range operator .. in scalar context:

In scalar context, ".." returns a boolean value. The operator is bistable, like a flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each ".." operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again. It doesn't become false till the next time the range operator is evaluated.

Here's a

simple example

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# slurp all DATA lines
my $dom = Mojo::DOM->new(do { local $/; <DATA> });

# select all children of <div id="yay"> into a Mojo::Collection
my $yay = $dom->at('#yay')->children;

# select interesting ('..' operator in scalar context: flip-flop)
my $interesting = $yay->grep(sub { my $e = shift;
    $e->type eq 'h1' .. $e->type eq 'h2';
});

say $interesting->join("\n");

__DATA__
<div id="yay">
    <span>This isn't interesting</span>
    <h1>INTERESTING STARTS HERE</h1>
    <strong>SOMETHING INTERESTING</strong>
    <span>INTERESTING TOO</span>
    <h2>END OF INTERESTING</h2>
    <span>This isn't interesting</span>
</div>

Output

<h1>INTERESTING STARTS HERE</h1>
<strong>SOMETHING INTERESTING</strong>
<span>INTERESTING TOO</span>
<h2>END OF INTERESTING</h2>

Explanation

So I'm using Mojo::Collection's grep to filter the collection object $yay. Since it looks for truth it creates a scalar context for the given function's return value and so the .. operator acts like a flip-flop. It becomes true after it first saw a h1 element and becomes false after it first saw a h2 element, so you get all lines between that headlines including themselves.

Since I think you know some Perl and you can use arbitrary tests together with .. I hope this helps to solve your problem!

来源：https://stackoverflow.com/questions/13809845/parsing-html-with-mojolicious-user-agent

标签

perl

mojolicious