I\'ve been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let\'s take th
Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.
HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.
Your regex contains <div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))*
part (same as <div((.+?(?=<\/?div>))|(?R))*
) before matching the closing <\/div>
part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:
<div
- match <div
literally (also, in <diverse
due to a missing word boundary or a \s
after it)(
- Group 1 that matches:
(.+?(?=<\/div>)|.+?(?=<div>))
- matches either any 1+ chars (as few as possible) up to the first </div>
or to the first <div>
|
(?R)
- Recurse (i.e. insert and use))*
- repeat Group 1 zero or more times.The problem is clear: the (.+?(?=<\/?div>))
part does not exclude matching <div>
or </div>
, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.
To match text other than some specific text use a tempered greedy token.
<div\b[^<]*>((?:(?!<\/?div\b).)+|(?R))*<\/div>\s*
^^^^^^^^^^^^^^^^^^^
See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.
What is important here is that (?:(?!<\/?div\b).)+
only matches 1 or more characters that are not the starting character of a <div....>
or </div
sequences. See my above linked thread on how that works.
As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:
<div\b[^<]*>(?:[^<]+(?:<(?!\/?div\b)[^<]*)*|(?R))*<\/div>\s*
See this regex demo
Now, the token looks like [^<]+(?:<(?!\/?div\b)[^<]*)*
: 1+ characters other than <
followed with 0+ sequences of <
that is not followed with /div
or div
(as a whole word) and then again 0+ non-<
s.
<div\b
might still match in <div-tmp
, so perhaps, <div(?:\s|>)
is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.