How do I filter all HTML tags except a certain whitelist?

后端未结

关注

 8  1524

一整个雨季

This is for .NET. IgnoreCase is set and MultiLine is NOT set.

Usually I\'m decent at regex, maybe I\'m running low on caffeine...

Users are allowed to enter

相关标签:

8条回答

盖世英雄少女心

2020-11-27 12:35
Here's a function I wrote for this task:
```
static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}
```
Edit: For some reason I posted a correction to my previous answer as a separate answer, so I am consolidating them here.

I will explain the regex a bit, because it is a little long.

The first part matches an open bracket and 0 or 1 slashes (in case it's a close tag).

Next you see an if-then construct with a look ahead. (?(?=SomeTag)then|else) I am checking to see if the next part of the string is one of the acceptable tags. You can see that I concatenate the regex string with the acceptable variable, which is the acceptable tag names seperated by a verticle bar so that any of the terms will match. If it is a match, you can see I put in the word "notag" because no tag would match that and if it is acceptable I want to leave it alone. Otherwise I move on to the else part, where i match any tag name [a-z,A-Z,0-9]+

Next, I want to match 0 or more attributes, which I assume are in the form attribute="value". so now I group this part representing an attribute but I use the ?: to prevent this group from being captured for speed: (?:\s[a-z,A-Z,0-9,-]+=?(?:(["",']?).?\1?))

Here I begin with the whitespace character that would be between the tag and attribute names, then match an attribute name: [a-z,A-Z,0-9,-]+

next I match an equals sign, and then either quote. I group the quote so it will be captured, and I can do a backreference later \1 to match the same type of quote. In between these two quotes, you can see I use the period to match anything, however I use the lazy version *? instead of the greedy version * so that it will only match up to the next quote that would end this value.

next we put a * after closing the groups with parenthesis so that it will match multiple attirbute/value combinations (or none). Last we match some whitespace with \s, and 0 or 1 ending slashes in the tag for xml style self closing tags.

You can see I'm replacing the tags with sausage, because I'm hungry, but you could replace them with empty string too to just clear them out.
0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2020-11-27 12:42

This is a good working example on html tag filtering:

Sanitize HTML

0 讨论(0)
发布评论:

提交评论
- 加载中...

时光取名叫无心

2020-11-27 12:43

    /// <summary>
    /// Trims the ignoring spacified tags
    /// </summary>
    /// <param name="text">the text from which html is to be removed</param>
    /// <param name="isRemoveScript">specify if you want to remove scripts</param>
    /// <param name="ignorableTags">specify the tags that are to be ignored while stripping</param>
    /// <returns>Stripped Text</returns>
    public static string StripHtml(string text, bool isRemoveScript, params string[] ignorableTags)
    {
        if (!string.IsNullOrEmpty(text))
        {
            text = text.Replace("&lt;", "<");
            text = text.Replace("&gt;", ">");
            string ignorePattern = null;

            if (isRemoveScript)
            {
                text = Regex.Replace(text, "<script[^<]*</script>", string.Empty, RegexOptions.IgnoreCase);
            }
            if (!ignorableTags.Contains("style"))
            {
                text = Regex.Replace(text, "<style[^<]*</style>", string.Empty, RegexOptions.IgnoreCase);
            }
            foreach (string tag in ignorableTags)
            {
                //the character b spoils the regex so replace it with strong
                if (tag.Equals("b"))
                {
                    text = text.Replace("<b>", "<strong>");
                    text = text.Replace("</b>", "</strong>");
                    if (ignorableTags.Contains("strong"))
                    {
                        ignorePattern = string.Format("{0}(?!strong)(?!/strong)", ignorePattern);
                    }
                }
                else
                {
                    //Create ignore pattern fo the tags to ignore
                    ignorePattern = string.Format("{0}(?!{1})(?!/{1})", ignorePattern, tag);
                }

            }
            //finally add the ignore pattern into regex <[^<]*> which is used to match all html tags
            ignorePattern = string.Format(@"<{0}[^<]*>", ignorePattern);
            text = Regex.Replace(text, ignorePattern, "", RegexOptions.IgnoreCase);
        }

        return text;
    }

0 讨论(0)

南旧

2020-11-27 12:45

Attributes are the major problem with using regexes to try to work with HTML. Consider the sheer number of potential attributes, and the fact that most of them are optional, and also the fact that they can appear in any order, and the fact that ">" is a legal character in quoted attribute values. When you start trying to take all of that into account, the regex you'd need to deal with it all will quickly become unmanageable.

What I would do instead is use an event-based HTML parser, or one that gives you a DOM tree that you can walk through.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-11-27 12:46
The reason that adding the word boundary \b didn't work is that you didn't put it inside the lookahead. Thus, \b will be attempted after < where it will always match if the < starts an HTML tag.

Put it inside the lookahead like this:
```
<(?!/?(i|b|h3|h4|a|img)\b)[^>]+>
```
This also shows how you can put the / before the list of tags, rather than with each tag.
0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2020-11-27 12:46
I think i originally intended to make the values optional, but didn't follow through, as I can see that I added a ? after the equals sign and grouped the value portion of the match. Let's add a ? after that group (marked with a carot) to make it optional in the match as well. I'm not at my compiler right now, but see if this works:
```
@"</?(?(?=" + acceptable + @")notag|[a-z,A-Z,0-9]+)(?:\s[a-z,A-Z,0-9,\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
                                                                                             ^
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页