Regex to match all HTML tags except
and

后端未结

关注

 13  683

I need to match and remove all tags using a regular expression in Perl. I have the following:

<\\\\??(?!p).+?>

But this still matche

相关标签:

13条回答

旧巷少年郎

2020-11-30 07:09
The original regex can be made to work with very little effort:
```
 <(?>/?)(?!p).+?>
```
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.

(That said I agree that generally parsing HTML with regexes is not the way to go).
0 讨论(0)
发布评论:

提交评论
- 加载中...
佛祖请我去吃肉

2020-11-30 07:11
Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

With all the disclaimers about using regex to parse html, here is a simple way to do it.
```
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> My paragraph Italics second';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
```
See this live demo

Reference

How to match pattern except in situations s1, s2, s3

How to match a pattern unless...
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-11-30 07:15
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).

For example, this:
```
<HTML /
 <HEAD /
 <TITLE / > /
 
```
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)

It is semantically equivalent to
```
<html>
 <head>
 <title>
 &gt;
 </title>
 </head>
 <body>
 
 &gt;
 
 </body>
</html>
```
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-11-30 07:15

You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but is perfectly valid HTML.

0 讨论(0)
发布评论:

提交评论
- 加载中...

独厮守ぢ

2020-11-30 07:16

I came up with this:

<(?!\/?p(?=>|\s.*>))\/?.*?>

x/
<           # Match open angle bracket
(?!         # Negative lookahead (Not matching and not consuming)
    \/?     # 0 or 1 /
    p           # p
    (?=     # Positive lookahead (Matching and not consuming)
    >       # > - No attributes
        |       # or
    \s      # whitespace
    .*      # anything up to 
    >       # close angle brackets - with attributes
    )           # close positive lookahead
)           # close negative lookahead
            # if we have got this far then we don't match
            # a p tag or closing p tag
            # with or without attributes
\/?         # optional close tag symbol (/)
.*?         # and anything up to
>           # first closing tag
/

This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.

It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.

0 讨论(0)

盖世英雄少女心

2020-11-30 07:20
You should probably also remove any attributes on the tag, since someone bad could do something like:
```
Clickable text
```
The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with tags without attributes. Just to be on the safe side.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Regex to match all HTML tags except and

Regex to match all HTML tags except
and