Regex - I only want to match the start tags in regex

问题

I am making a regex expression in which I only want to match wrong tags like:  *some text here, some other tags may be here as well but no ending 'p' tag* 

 <P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>

In the above same text I want to get the result as (of the western circuit) and nothing else should be captured. I'm using this but its not working:

<P>[^\(</P>\)]*<P>

Please help.

回答1:

Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.

For xhtml, I'd use XmlDocument/XDocument and an xpath query.

For "non-x" html, I'd look at the HTML Agility Pack and the same.

回答2:

Match group one of:

(?:<p>(?:(?!<\/?p>).?)+)(<p>)

matches the second  in:

<P>(of the western circuit)<P>PREFACE</P>

Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.

回答3:

I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like

<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>

I'm pretty sure the regular expressions given so-far would match the second  there, even though it is not actually an unclosed .

回答4:

Rather than using * for maximal match, use *? for minimal.

Should be able to make a start with

<P>((?!</P>).)*?<P>

This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "" matches.

EDIT: Corrected to put assertion (thanks to commenter).

回答5:

All of the solutions offered so far match the second , but that's wrong. What if there are two consecutive elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:

@"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"

As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).

来源：https://stackoverflow.com/questions/577210/regex-i-only-want-to-match-the-start-tags-in-regex

标签

.net

html

regex