How can I remove unused, nested HTML span tags with a Perl regex?

后端未结

关注

 4  1110

南旧 2021-01-06 10:23

I\'m trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn\'t want with other regular expressions

4条回答

情话喂你 (楼主)

2021-01-06 11:07
Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:
```
(]*>.*+(?1)?.*+<\/span>)
```
See perlfaq 6.11.

Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed start-tags, allowing the end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...