Perl add around words within an HTML/XML tag

后端未结

关注

 3  1651

I have a file formatted like this one:

Eye color
Eye color, color
 blue, cornflower blue, steely blue


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-21 20:41
              
            
            
                                                                       
Parse the file using a module and iterate over the elements you need (<p> of class ul1). Extract those comma-separated phrases from each and wrap links around them; then replace the element with that new content. Write the changed tree out in the end.

Using HTML::TreeBuilder (with its workhorse HTML::Element)

use warnings;
use strict;
use feature 'say';

use HTML::Entities;
use HTML::TreeBuilder;

my $file = shift // die "Usage: $0 file\n";

my $tree = HTML::TreeBuilder->new_from_file($file);

foreach my $elem ($tree->look_down(_tag => "p", class => "ul1")) {   
    my @new_content;
    for ($elem->content_list) { 
        my @w = split /\s*,\s*/; 
        my $wrapped = join ", ", 
            map { qq(<a href="entry://$_">).$_.q(</a>) } @w; 
        push @new_content, $wrapped;
    }
    $elem->delete_content;
    $elem->push_content( @new_content );
}; 

say decode_entities $tree->as_HTML; 


In your case an element ($elem) will have one item in the content_list so you don't have to collect modified content into an array (@new_content) but can process that one piece only, what simplifies the code.  Working with a list as above doesn't hurt of course.

I redirect the output of this program to an .html file.  The generated file is qouite frugal on newlines. If pretty HTML matters make a pass with a tool like HTML::Tidy or HTML::PrettyPrinter.

In a one-liner?  Nah, it's too much. And please don't use regex as there's trouble down the road; it needs close work to get it right, is easy to end up buggy, is sensitive to smallest details, and brittle for even slightest changes in input. And that's when it can do the job. There are reasons for libraries.

Another good tool for this job is Mojo::DOM. For example

use Mojo::DOM;
use Path::Tiny;  # only to read the file into a string easily

my $html = path($file)->slurp;

my $dom = Mojo::DOM->new($html);

foreach my $elem ($dom->find('p.ul1')->each) {
    my @w = split /,/, $elem->text;
    my $new = join ', ',
        map { qq(<a href="entry://$_">).$_.q(</a>) } @w;
    $elem->replace( $new );
}

say $dom;


Produces the same HTML as above (just nicer, and note no need to deal with entities).

Newer module versions provide new_tag method with which the additional link above is made as

my $new = join ', ', 
   map { $e->new_tag('a', 'href' => "entry://$_", $_) } @w; 


what takes care of some subtle needs (HTML escaping for one). The main docs don't say when this method was added, see changelog (May 2018, so supposedly in v5.28; it works with my 5.29.2). 

I padded the shown sample to this file for testing:

<!DOCTYPE html>  <title>Eye color</title> <body>
<p class="ul">Eye color, color</p> 
<p class="ul1">blue, cornflower blue, steely blue</p> 
<p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css"></>
weasel
<p class="ul">weasel</p> 
<p class="ul1">musteline</p> <link rel="stylesheet" href="a.css"></>
</body> </html>




Update    It's been clarified that the given markup snippet isn't merely a fragment of a presumably full HTML document but that it is a file (as stated) that stands as shown, as a custom format using HTML; apart from the required changes the rest of it need be preserved.

A particularly unpleasant detail proves to be the </> part; each of HTML::TreeBuilder, Mojo::DOM, and XML::LibXML^† discards it while parsing. I couldn't find a way to make them keep that piece.

It was Marpa::HTML that processed the whole fragment as required, changing what was asked while leaving alone the rest of it.

use warnings;
use strict;
use feature 'say';
use Path::Tiny;

use Marpa::HTML qw(html);

my $file = shift // die "Usage: $0 file\n";
my $html = path($file)->slurp;

my $marpa = Marpa::HTML::html( 
    \$html,
    {
        'p.ul1' => sub {
            return join ', ', 
                map { qq(<a href="entry://$_">).$_.q(</a>) } 
                split /\s*,\s*/, Marpa::HTML::contents();
        },
    }
);  

say $$marpa; 


The processing of the <p> tags of class ul1 is the same as before: split the content on comma and wrap each piece into an <a> tag, then join them back with ,

This prints (with added line-breaks and indentation for readability)

Eye color
<p class="ul">Eye color, color</p> 
<a href="entry://blue">blue</a>, 
    <a href="entry://cornflower blue">cornflower blue</a>, 
    <a href="entry://steely blue">steely blue</a> 
    <a href="entry://velvet brown">velvet brown</a> 
<link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <a href="entry://musteline">musteline</a> 
<link rel="stylesheet" href="a.css">
</>


It is the overall approach of this module that is suited for a task like this


  Marpa::HTML is an extremely liberal HTML parser. Marpa::HTML does not reject any documents, no mater how poorly they fit the HTML standards.


Here it processed a custom piece of HTML-like markup, leaving things like </> in place.



^† 
See this post for an example of very permissive processing of HTML with XML::LibXML
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-21 20:48
              
            
            
                                                                       
perl -0777 -MWeb::Query=wq -lne'
    my $w = wq $_; my $sep = ", ";
    $w->filter("p.ul1")->each(sub {
        my (undef, $e) = @_;
        $e->html(join $sep, map {
            qq(<a href="entry://$_">$_</a>)
        } split $sep, $e->text);
    });
    print $w->as_html;
'

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南方客        
                
              
                            
                2020-12-21 20:54
              
            
            
                                                                       
One-liner:

cat text | perl -pE 's{<p class="ul1">\K.*?(?=<\/p>)}{ join ", ", map {qq|<a href="entry://$_">$_</a>|} split /, */, $& }eg'

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复