Regex - I only want to match the start tags in regex

雨燕双飞 提交于 2019-12-19 04:15:43

问题


I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>

 <P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>

In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:

<P>[^\(</P>\)]*<P>

Please help.


回答1:


Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.

For xhtml, I'd use XmlDocument/XDocument and an xpath query.

For "non-x" html, I'd look at the HTML Agility Pack and the same.




回答2:


Match group one of:

(?:<p>(?:(?!<\/?p>).?)+)(<p>)

matches the second <p> in:

<P>(of the western circuit)<P>PREFACE</P>

Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.




回答3:


I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like

<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>

I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.




回答4:


Rather than using * for maximal match, use *? for minimal.

Should be able to make a start with

<P>((?!</P>).)*?<P>

This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.

EDIT: Corrected to put assertion (thanks to commenter).




回答5:


All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:

@"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"

As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).



来源:https://stackoverflow.com/questions/577210/regex-i-only-want-to-match-the-start-tags-in-regex

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!