How can I remove unused, nested HTML span tags with a Perl regex?

后端未结

关注

 4  1111

I\'m trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn\'t want with other regular expressions

相关标签:

4条回答

灰色年华

2021-01-06 10:54

With all your help I've published a script that does everything I need.

http://github.com/timabell/decrufter/

0 讨论(0)
发布评论:

提交评论
- 加载中...

不知归路

2021-01-06 11:01

Try HTML::Parser:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my @print_span;
my $p = HTML::Parser->new(
  start_h   => [ sub {
    my ($text, $name, $attr) = @_;
    if ( $name eq 'span' ) {
      my $print_tag = %$attr;
      push @print_span, $print_tag;
      return if !$print_tag;
    }
    print $text;
  }, 'text,tagname,attr'],
  end_h => [ sub {
    my ($text, $name) = @_;
    if ( $name eq 'span' ) {
      return if !pop @print_span;
    }
    print $text;
  }, 'text,tagname'],
  default_h => [ sub { print shift }, 'text'],
);
$p->parse_file(\*DATA) or die "Err: $!";
$p->eof;

__END__
<html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>This is a header</h1>
a <span>b <span style="color:red;">c</span> d</span>e
</body>
</html>

0 讨论(0)

滥情空心

2021-01-06 11:01
Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:
```
a b ">c de
```
Would you have thought of that?

Use an XML processor instead.

Also see the Related Questions (to the right) for your question.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2021-01-06 11:07
Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:
```
(<span[^>]*>.*+(?1)?.*+<\/span>)
```
See perlfaq 6.11.

Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed start-tags, allowing the end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.
0 讨论(0)
发布评论:

提交评论
- 加载中...