and
I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\\\??(?!p).+?>
But this still matche
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just print
ing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.