I\'m trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn\'t want with other regular expressions
With all your help I've published a script that does everything I need.
http://github.com/timabell/decrufter/
Try HTML::Parser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @print_span;
my $p = HTML::Parser->new(
start_h => [ sub {
my ($text, $name, $attr) = @_;
if ( $name eq 'span' ) {
my $print_tag = %$attr;
push @print_span, $print_tag;
return if !$print_tag;
}
print $text;
}, 'text,tagname,attr'],
end_h => [ sub {
my ($text, $name) = @_;
if ( $name eq 'span' ) {
return if !pop @print_span;
}
print $text;
}, 'text,tagname'],
default_h => [ sub { print shift }, 'text'],
);
$p->parse_file(\*DATA) or die "Err: $!";
$p->eof;
__END__
<html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>This is a header</h1>
a <span>b <span style="color:red;">c</span> d</span>e
</body>
</html>
Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:
a <span>b <span style="color:red;" title="being closed with </span>">c</span> de
Would you have thought of that?
Use an XML processor instead.
Also see the Related Questions (to the right) for your question.
Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).
This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:
(<span[^>]*>.*+(?1)?.*+<\/span>)
See perlfaq 6.11.
Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed <span> start-tags, allowing the </span> end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.
You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.