I have a file formatted like this one:
Eye color
Eye color, color
blue, cornflower blue, steely blue
Parse the file using a module and iterate over the elements you need (<p>
of class ul1
). Extract those comma-separated phrases from each and wrap links around them; then replace the element with that new content. Write the changed tree out in the end.
Using HTML::TreeBuilder (with its workhorse HTML::Element)
use warnings;
use strict;
use feature 'say';
use HTML::Entities;
use HTML::TreeBuilder;
my $file = shift // die "Usage: $0 file\n";
my $tree = HTML::TreeBuilder->new_from_file($file);
foreach my $elem ($tree->look_down(_tag => "p", class => "ul1")) {
my @new_content;
for ($elem->content_list) {
my @w = split /\s*,\s*/;
my $wrapped = join ", ",
map { qq(<a href="entry://$_">).$_.q(</a>) } @w;
push @new_content, $wrapped;
}
$elem->delete_content;
$elem->push_content( @new_content );
};
say decode_entities $tree->as_HTML;
In your case an element ($elem
) will have one item in the content_list
so you don't have to collect modified content into an array (@new_content
) but can process that one piece only, what simplifies the code. Working with a list as above doesn't hurt of course.
I redirect the output of this program to an .html
file. The generated file is qouite frugal on newlines. If pretty HTML matters make a pass with a tool like HTML::Tidy or HTML::PrettyPrinter.
In a one-liner? Nah, it's too much. And please don't use regex as there's trouble down the road; it needs close work to get it right, is easy to end up buggy, is sensitive to smallest details, and brittle for even slightest changes in input. And that's when it can do the job. There are reasons for libraries.
Another good tool for this job is Mojo::DOM. For example
use Mojo::DOM;
use Path::Tiny; # only to read the file into a string easily
my $html = path($file)->slurp;
my $dom = Mojo::DOM->new($html);
foreach my $elem ($dom->find('p.ul1')->each) {
my @w = split /,/, $elem->text;
my $new = join ', ',
map { qq(<a href="entry://$_">).$_.q(</a>) } @w;
$elem->replace( $new );
}
say $dom;
Produces the same HTML as above (just nicer, and note no need to deal with entities).
Newer module versions provide new_tag method with which the additional link above is made as
my $new = join ', ',
map { $e->new_tag('a', 'href' => "entry://$_", $_) } @w;
what takes care of some subtle needs (HTML escaping for one). The main docs don't say when this method was added, see changelog (May 2018, so supposedly in v5.28; it works with my 5.29.2).
I padded the shown sample to this file for testing:
<!DOCTYPE html> <title>Eye color</title> <body>
<p class="ul">Eye color, color</p>
<p class="ul1">blue, cornflower blue, steely blue</p>
<p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css"></>
weasel
<p class="ul">weasel</p>
<p class="ul1">musteline</p> <link rel="stylesheet" href="a.css"></>
</body> </html>
Update It's been clarified that the given markup snippet isn't merely a fragment of a presumably full HTML document but that it is a file (as stated) that stands as shown, as a custom format using HTML; apart from the required changes the rest of it need be preserved.
A particularly unpleasant detail proves to be the </>
part; each of HTML::TreeBuilder
, Mojo::DOM
, and XML::LibXML
† discards it while parsing. I couldn't find a way to make them keep that piece.
It was Marpa::HTML that processed the whole fragment as required, changing what was asked while leaving alone the rest of it.
use warnings;
use strict;
use feature 'say';
use Path::Tiny;
use Marpa::HTML qw(html);
my $file = shift // die "Usage: $0 file\n";
my $html = path($file)->slurp;
my $marpa = Marpa::HTML::html(
\$html,
{
'p.ul1' => sub {
return join ', ',
map { qq(<a href="entry://$_">).$_.q(</a>) }
split /\s*,\s*/, Marpa::HTML::contents();
},
}
);
say $$marpa;
The processing of the <p>
tags of class ul1
is the same as before: split the content on comma and wrap each piece into an <a>
tag, then join them back with ,
This prints (with added line-breaks and indentation for readability)
Eye color
<p class="ul">Eye color, color</p>
<a href="entry://blue">blue</a>,
<a href="entry://cornflower blue">cornflower blue</a>,
<a href="entry://steely blue">steely blue</a>
<a href="entry://velvet brown">velvet brown</a>
<link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <a href="entry://musteline">musteline</a>
<link rel="stylesheet" href="a.css">
</>
It is the overall approach of this module that is suited for a task like this
Marpa::HTML
is an extremely liberal HTML parser.Marpa::HTML
does not reject any documents, no mater how poorly they fit the HTML standards.
Here it processed a custom piece of HTML-like markup, leaving things like </>
in place.
†
See this post for an example of very permissive processing of HTML with XML::LibXML
perl -0777 -MWeb::Query=wq -lne'
my $w = wq $_; my $sep = ", ";
$w->filter("p.ul1")->each(sub {
my (undef, $e) = @_;
$e->html(join $sep, map {
qq(<a href="entry://$_">$_</a>)
} split $sep, $e->text);
});
print $w->as_html;
'
One-liner:
cat text | perl -pE 's{<p class="ul1">\K.*?(?=<\/p>)}{ join ", ", map {qq|<a href="entry://$_">$_</a>|} split /, */, $& }eg'