How can I remove unused, nested HTML span tags with a Perl regex?

后端 未结 4 1110
南旧
南旧 2021-01-06 10:23

I\'m trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn\'t want with other regular expressions

4条回答
  •  情话喂你
    2021-01-06 11:07

    Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

    This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:

    (]*>.*+(?1)?.*+<\/span>)
    

    See perlfaq 6.11.

    Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed start-tags, allowing the end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

    You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.

提交回复
热议问题