Regex to match all HTML tags except

and

后端 未结 13 684
抹茶落季
抹茶落季 2020-11-30 06:31

I need to match and remove all tags using a regular expression in Perl. I have the following:

<\\\\??(?!p).+?>

But this still matche

相关标签:
13条回答
  • 2020-11-30 07:22

    If you insist on using a regex, something like this will work in most cases:

    # Remove all HTML except "p" tags
    $html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
    

    Explanation:

    s{
      <             # opening angled bracket
      (?>/?)        # ratchet past optional / 
      (?:
        [^pP]       # non-p tag
        |           # ...or...
        [pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
      )
      [^>]*         # everything until closing angled bracket
      >             # closing angled bracket
     }{}gx; # replace with nothing, globally
    

    But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:

    use strict;
    
    use HTML::TokeParser;
    
    my $parser = HTML::TokeParser->new('/some/file.html')
      or die "Could not open /some/file.html - $!";
    
    while(my $t = $parser->get_token)
    {
      # Skip start or end tags that are not "p" tags
      next  if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
    
      # Print everything else normally (see HTML::TokeParser docs for explanation)
      if($t->[0] eq 'T')
      {
        print $t->[1];
      }
      else
      {
        print $t->[-1];
      }
    }
    

    HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.

    0 讨论(0)
提交回复
热议问题