I am making a regex expression in which I only want to match wrong tags like: *some text here, some other tags may be here as well but no ending \'p\' tag* &
All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:
@"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"
As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).
I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like
<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>
I'm pretty sure the regular expressions given so-far would match the second <p>
there, even though it is not actually an unclosed <p>
.
Match group one of:
(?:<p>(?:(?!<\/?p>).?)+)(<p>)
matches the second <p>
in:
<P>(of the western circuit)<P>PREFACE</P>
Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.
Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.
For xhtml, I'd use XmlDocument
/XDocument
and an xpath query.
For "non-x" html, I'd look at the HTML Agility Pack and the same.
Rather than using *
for maximal match, use *?
for minimal.
Should be able to make a start with
<P>((?!</P>).)*?<P>
This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>
" matches.
EDIT: Corrected to put assertion (thanks to commenter).