Remove all empty HTML tags?

你离开我真会死。 提交于 2019-11-29 05:20:57
ridgerunner

First, note that empty HTML elements are, by definition, not nested.

Update: The solution below now applies the empty element regex recursively to remove "nested-empty-element" structures such as: <p><strong></strong></p> (subject to the caveats stated below).

Simple version:

This works pretty well (see caveats below) for HTML having no start tag attributes containing <> funny stuff, in the form of an (untested) VB.NET snippet:

Dim RegexObj As New Regex("<(\w+)\b[^>]*>\s*</\1\s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

Enhanced Version

<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>

Here is the uncommented enhanced version in VB.NET (untested):

Dim RegexObj As New Regex("<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:""[^""]*""|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

This more complex regex correctly matches a valid empty HTML 4.01 element even if it has angle brackets in its attribute values (subject once again, to the caveats below). In other words, this regex correctly handles all start tag attribute values which are quoted (which can have <>), unquoted (which can't) and empty. Here is a fully commented (and tested) PHP version:

function strip_empty_tags($text) {
    // Match empty elements (attribute values may have angle brackets).
    $re = '%
        # Regex to match an empty HTML 4.01 Transitional element.
        <                    # Opening tag opening "<" delimiter.
        (\w+)\b              # $1 Tag name.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        >                    # Opening tag closing ">" delimiter.
        \s*                  # Content is zero or more whitespace.
        </\1\s*>             # Element closing tag.
        %x';
    while (preg_match($re, $text)) {
        // Recursively remove innermost empty elements.
        $text = preg_replace($re, '', $text);
    }
}

Caveats: This function does not parse HTML. It simply matches and removes any text pattern sequence corresponding to a valid empty HTML 4.01 element (which, by definition, is not nested). Note that this also erroneously matches and removes the same text pattern which may occur outside normal HTML markup, such as within SCRIPT and STYLE tags and HTML comments and the attributes of other start tags. This regex does not work with short tags. To any bobenc fan about give this answer an automatic down vote, please show me one valid HTML 4.01 empty element that this regex fails to correctly match. This regex follows the W3C spec and really does work.

Update: This regex solution also does not work (and will erroneously remove valid markup) if you do something insanely unlikely (but perfectly valid) like this:

<div att="<p att='">stuff</div><div att="'></p>'">stuff</div>

Summary:

On second thought, just use an HTML parser!

The problem you face is the arbitrary levels of nesting, which cannot be matched with a standard regex. I suppose you could apply the same regex replacement over and over again until nothing is left. But there are better solutions out there, such as a dedicated HTML parsing library.

You can't do it with a regular expression. You could probably use an xml parser assuming the html is well formed.

Why recursive though, you could simply run

 <(\w+)\s*>\s*</\1\s*>

and replace it with nothing, and keep applying that regular expression until your input doesn't change anymore.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!