Regex - I only want to match the start tags in regex

后端 未结 5 1486
醉话见心
醉话见心 2021-01-06 23:36

I am making a regex expression in which I only want to match wrong tags like:

*some text here, some other tags may be here as well but no ending \'p\' tag* &

相关标签:
5条回答
  • 2021-01-07 00:08

    All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:

    @"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"
    

    As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).

    0 讨论(0)
  • 2021-01-07 00:17

    I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like

    <p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>
    

    I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.

    0 讨论(0)
  • 2021-01-07 00:21

    Match group one of:

    (?:<p>(?:(?!<\/?p>).?)+)(<p>)
    

    matches the second <p> in:

    <P>(of the western circuit)<P>PREFACE</P>
    

    Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.

    0 讨论(0)
  • 2021-01-07 00:27

    Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.

    For xhtml, I'd use XmlDocument/XDocument and an xpath query.

    For "non-x" html, I'd look at the HTML Agility Pack and the same.

    0 讨论(0)
  • 2021-01-07 00:30

    Rather than using * for maximal match, use *? for minimal.

    Should be able to make a start with

    <P>((?!</P>).)*?<P>
    

    This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.

    EDIT: Corrected to put assertion (thanks to commenter).

    0 讨论(0)
提交回复
热议问题