How can I remove unused, nested HTML span tags with a Perl regex?

后端 未结 4 1111
南旧
南旧 2021-01-06 10:23

I\'m trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn\'t want with other regular expressions

相关标签:
4条回答
  • 2021-01-06 10:54

    With all your help I've published a script that does everything I need.

    http://github.com/timabell/decrufter/

    0 讨论(0)
  • 2021-01-06 11:01

    Try HTML::Parser:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use HTML::Parser;
    
    my @print_span;
    my $p = HTML::Parser->new(
      start_h   => [ sub {
        my ($text, $name, $attr) = @_;
        if ( $name eq 'span' ) {
          my $print_tag = %$attr;
          push @print_span, $print_tag;
          return if !$print_tag;
        }
        print $text;
      }, 'text,tagname,attr'],
      end_h => [ sub {
        my ($text, $name) = @_;
        if ( $name eq 'span' ) {
          return if !pop @print_span;
        }
        print $text;
      }, 'text,tagname'],
      default_h => [ sub { print shift }, 'text'],
    );
    $p->parse_file(\*DATA) or die "Err: $!";
    $p->eof;
    
    __END__
    <html>
    <head>
    <title>This is a title</title>
    </head>
    <body>
    <h1>This is a header</h1>
    a <span>b <span style="color:red;">c</span> d</span>e
    </body>
    </html>
    
    0 讨论(0)
  • 2021-01-06 11:01

    Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:

    a <span>b <span style="color:red;" title="being closed with </span>">c</span> de
    

    Would you have thought of that?

    Use an XML processor instead.

    Also see the Related Questions (to the right) for your question.

    0 讨论(0)
  • 2021-01-06 11:07

    Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

    This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:

    (<span[^>]*>.*+(?1)?.*+<\/span>)
    

    See perlfaq 6.11.

    Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed <span> start-tags, allowing the </span> end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

    You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.

    0 讨论(0)
提交回复
热议问题