Final solution for using regex to remove html nested tags of the same type?

前端未结

关注

 1  616

I\'ve been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let\'s take th

相关标签:

1条回答

故里飘歌

2021-01-20 20:36
DISCLAIMER

Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.

HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.

What is wrong with your regex

Your regex contains <div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))* part (same as <div((.+?(?=<\/?div>))|(?R))*) before matching the closing <\/div> part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:
- <div - match <div literally (also, in <diverse due to a missing word boundary or a \s after it)
- ( - Group 1 that matches:
  - (.+?(?=<\/div>)|.+?(?=<div>)) - matches either any 1+ chars (as few as possible) up to the first </div> or to the first <div>
  - |
  - (?R) - Recurse (i.e. insert and use)
- )* - repeat Group 1 zero or more times.
The problem is clear: the (.+?(?=<\/?div>)) part does not exclude matching <div> or </div>, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.

Solution(s)

To match text other than some specific text use a tempered greedy token.
```
<div\b[^<]*>((?:(?!<\/?div\b).)+|(?R))*<\/div>\s*
             ^^^^^^^^^^^^^^^^^^^ 
```
See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.

What is important here is that (?:(?!<\/?div\b).)+ only matches 1 or more characters that are not the starting character of a <div....> or </div sequences. See my above linked thread on how that works.

As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:
```
<div\b[^<]*>(?:[^<]+(?:<(?!\/?div\b)[^<]*)*|(?R))*<\/div>\s*
```
See this regex demo

Now, the token looks like [^<]+(?:<(?!\/?div\b)[^<]*)*: 1+ characters other than < followed with 0+ sequences of < that is not followed with /div or div (as a whole word) and then again 0+ non-<s.

<div\b might still match in <div-tmp, so perhaps, <div(?:\s|>) is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Final solution for using regex to remove html nested tags of the same type?

DISCLAIMER

What is wrong with your regex

Solution(s)