How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?

旧巷老猫 提交于 2020-08-24 07:26:44

问题


I'm using Mojo::DOM to identify and print out phrases (meaning strings of text between selected HTML tags) in hundreds of HTML documents that I'm extracting from existing content in the Movable Type content management system.

I'm writing those phrases out to a file, so they can be translated into other languages as follows:

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

When Mojo::DOM encountered embedded HTML entities (such as ™ and  ) it converted those entities into encoded characters, rather than passing along as written. I wanted the entities to be passed through as written.

I recognized that I could use Mojo::Util::decode to pass these HTML entities through to the file I'm writing. The problem is "You can only call decode 'UTF-8' on a string that contains valid UTF-8. If it doesn't, for example because it is already converted to Perl characters, it will return undef."

If this is the case, I have to either try to figure out how to test the encoding of the current HTML page before calling Mojo::Util::decode('UTF-8', $page->text), or I must use some other technique to preserve the encoded HTML entities.

How do I most reliably preserve encoded HTML Entities when processing HTML documents with Mojo::DOM?


回答1:


Looks like when you map to text you get XML entities replaced, but when you instead work with the nodes and use their content, the entities are preserved. This minimal example:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p>this &amp; &quot;that&quot;</p>');
for my $phrase ($dom->find('p')->each) {
    print $phrase->content(), "\n";
}

prints:

this &amp; &quot;that&quot;

If you want to keep your loop and map, replace map('text') with map('content') like this:

for my $phrase ($dom->find('p')->map('content')->each) {

If you have nested tags and want to find only the texts (but not print those nested tag names, only their contents), you'll need to scan the DOM tree:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p><i>this &amp; <b>&quot;</b><b>that</b><b>&quot;</b></i></p><p>done</p>');

for my $node (@{$dom->find('p')->to_array}) {
    print_content($node);
}

sub print_content {
    my ($node) = @_;
    if ($node->type eq "text") {
        print $node->content(), "\n";
    }
    if ($node->type eq "tag") {    
        for my $child ($node->child_nodes->each) {
            print_content($child);
        }
    }
}

which prints:

this & 
"
that
"
done



回答2:


Through testing, my colleagues and I were able to determine that Mojo::DOM->new() was decoding ampersand characters (&) automatically, rendering the preservation of HTML Entities as written impossible. To get around this, we added the following subroutine to double encode ampersand:

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or &amp; characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&amp;!g;  # HTML encode ampersand characters
    return $text;
}

Later in the script we pass $page->text through encode_amp() as we instantiate a new Mojo::DOM object.

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# https://stackoverflow.com/questions/55130871/how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-wit#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

The code block above incorporates previous suggestions from @Grinnz as seen in the comments in this question. Thanks also to @Robert for his answer, which had a good observation about how Mojo::DOM works.

This code definitely works for my application.



来源:https://stackoverflow.com/questions/55130871/how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-wit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!