Final solution for using regex to remove html nested tags of the same type?

前端 未结 1 591
挽巷
挽巷 2021-01-20 19:55

I\'ve been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let\'s take th

相关标签:
1条回答
  • 2021-01-20 20:36

    DISCLAIMER

    Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.

    HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.

    What is wrong with your regex

    Your regex contains <div((.+?(?=<\/div>)|.+?(?=<div>))|(?R))* part (same as <div((.+?(?=<\/?div>))|(?R))*) before matching the closing <\/div> part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:

    • <div - match <div literally (also, in <diverse due to a missing word boundary or a \s after it)
    • ( - Group 1 that matches:
      • (.+?(?=<\/div>)|.+?(?=<div>)) - matches either any 1+ chars (as few as possible) up to the first </div> or to the first <div>
      • |
      • (?R) - Recurse (i.e. insert and use)
    • )* - repeat Group 1 zero or more times.

    The problem is clear: the (.+?(?=<\/?div>)) part does not exclude matching <div> or </div>, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.

    Solution(s)

    To match text other than some specific text use a tempered greedy token.

    <div\b[^<]*>((?:(?!<\/?div\b).)+|(?R))*<\/div>\s*
                 ^^^^^^^^^^^^^^^^^^^ 
    

    See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.

    What is important here is that (?:(?!<\/?div\b).)+ only matches 1 or more characters that are not the starting character of a <div....> or </div sequences. See my above linked thread on how that works.

    As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:

    <div\b[^<]*>(?:[^<]+(?:<(?!\/?div\b)[^<]*)*|(?R))*<\/div>\s*
    

    See this regex demo

    Now, the token looks like [^<]+(?:<(?!\/?div\b)[^<]*)*: 1+ characters other than < followed with 0+ sequences of < that is not followed with /div or div (as a whole word) and then again 0+ non-<s.

    <div\b might still match in <div-tmp, so perhaps, <div(?:\s|>) is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.

    0 讨论(0)
提交回复
热议问题