Regex to match all HTML tags except

and

后端 未结 13 683
抹茶落季
抹茶落季 2020-11-30 06:31

I need to match and remove all tags using a regular expression in Perl. I have the following:

<\\\\??(?!p).+?>

But this still matche

相关标签:
13条回答
  • 2020-11-30 07:09

    The original regex can be made to work with very little effort:

     <(?>/?)(?!p).+?>
    

    The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.

    (That said I agree that generally parsing HTML with regexes is not the way to go).

    0 讨论(0)
  • 2020-11-30 07:11

    Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

    With all the disclaimers about using regex to parse html, here is a simple way to do it.

    #!/usr/bin/perl
    $regex = '(<\/?p[^>]*>)|<[^>]*>';
    $subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
    ($replaced = $subject) =~ s/$regex/$1/eg;
    print $replaced . "\n";
    

    See this live demo

    Reference

    How to match pattern except in situations s1, s2, s3

    How to match a pattern unless...

    0 讨论(0)
  • 2020-11-30 07:15

    In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).

    For example, this:

    <HTML /
      <HEAD /
        <TITLE / > /
        <P / >
    

    is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)

    It is semantically equivalent to

    <html>
      <head>
        <title>
          &gt;
        </title>
      </head>
      <body>
        <p>
          &gt;
        </p>
      </body>
    </html>
    

    But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.

    0 讨论(0)
  • 2020-11-30 07:15

    You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.

    0 讨论(0)
  • 2020-11-30 07:16

    I came up with this:

    <(?!\/?p(?=>|\s.*>))\/?.*?>
    
    x/
    <           # Match open angle bracket
    (?!         # Negative lookahead (Not matching and not consuming)
        \/?     # 0 or 1 /
        p           # p
        (?=     # Positive lookahead (Matching and not consuming)
        >       # > - No attributes
            |       # or
        \s      # whitespace
        .*      # anything up to 
        >       # close angle brackets - with attributes
        )           # close positive lookahead
    )           # close negative lookahead
                # if we have got this far then we don't match
                # a p tag or closing p tag
                # with or without attributes
    \/?         # optional close tag symbol (/)
    .*?         # and anything up to
    >           # first closing tag
    /
    

    This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.

    It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.

    0 讨论(0)
  • 2020-11-30 07:20

    You should probably also remove any attributes on the <p> tag, since someone bad could do something like:

    <p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
    

    The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.

    0 讨论(0)
提交回复
热议问题