How to close unclosed HTML Tags?

后端未结

关注

 8  553

Whenever we are fetching some user inputed content with some editing from the database or similar sources, we might retrieve the portion which only contains the opening tag

相关标签:

8条回答

执笔经年

2020-11-30 02:14

I used to the native DOMDocument method, but with a few improvements for safety.

Note, other answers that use DOMDocument do not consider html strands such as

This is a <em>HTML</em> strand

The above will actually result in

<p>This is a <em>HTML</em> strand

My Solution is below

function closeDanglingTags($html) {
    if (strpos($html, '<') || strpos($html, '>')) {
        // There are definitiley HTML tags
        $wrapped = false;
        if (strpos(trim($html), '<') !== 0) {
            // The HTML starts with a text node. Wrap it in an element with an id to prevent the software wrapping it with a <p>
            //  that we know nothing about and cannot safely retrieve
            $html = cHE::getDivHtml($html, null, 'closedanglingtagswrapper');
            $wrapped = true;
        }
        $doc = new DOMDocument();
        $doc->encoding = 'utf-8';
        @$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
        if ($doc->firstChild) {
            // Test whether the firstchild is definitely a DOMDocumentType
            if ($doc->firstChild instanceof DOMDocumentType) {
                // Remove the added doctype
                $doc->removeChild($doc->firstChild);
            }
        }
        if ($wrapped) {
            // The contents originally started with a text node and was wrapped in a div#plasmappclibtextwrap. Take the contents
            //  out of that div
            $node = $doc->getElementById('closedanglingtagswrapper');
            $children = $node->childNodes;  // The contents of the div. Equivalent to $('selector').children()
            $doc = new DOMDocument();   // Create a new document to add the contents to, equiv. to "var doc = $('<html></html>');"
            foreach ($children as $childnode) {
                $doc->appendChild($doc->importNode($childnode, true)); // E.g. doc.append()
            }
        }
        // Remove the added html,body tags
        return trim(str_replace(array('<html><body>', '</body></html>'), '', html_entity_decode($doc->saveHTML())));
    } else {
        return $html;
    }
}

0 讨论(0)

清歌不尽

2020-11-30 02:18

For HTML fragments, and working from KJS's answer I have had success with the following when the fragment has one root element:

$dom = new DOMDocument();
$dom->loadHTML($string);
$body = $dom->documentElement->firstChild->firstChild;
$string = $dom->saveHTML($body);

Without a root element this is possible (but seems to wrap only the first text child node in p tags in text <p>para</p> text):

$dom = new DOMDocument();
$dom->loadHTML($string);
$bodyChildNodes = $dom->documentElement->firstChild->childNodes;

$string = '';
foreach ($bodyChildNodes as $node){
   $string .= $dom->saveHTML($node);
}

Or better yet, from PHP >= 5.4 and libxml >= 2.7.8 (2.7.7 for LIBXML_HTML_NOIMPLIED):

$dom = new DOMDocument();

// Load with no html/body tags and do not add a default dtd
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$string = $dom->saveHTML();

0 讨论(0)

花落未央

2020-11-30 02:23
Found a great answer for this one:

Use PHP 5 and use the loadHTML() method of the DOMDocument object. This auto parses badly formed HTML and a subsequent call to saveXML() will output the valid HTML. The DOM functions can be found here:

http://www.php.net/dom

The usage of this:
```
$doc = new DOMDocument();
$doc->loadHTML($yourText);
$yourText = $doc->saveHTML();
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-11-30 02:23

You can use Tidy:

Tidy is a binding for the Tidy HTML clean and repair utility which allows you to not only clean and otherwise manipulate HTML documents, but also traverse the document tree.

or HTMLPurifier

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

0 讨论(0)
发布评论:

提交评论
- 加载中...

一个人的身影

2020-11-30 02:23

A better PHP function to delete not open/not closed tags from webmaster-glossar.de (me)

function closetag($html){
    $html_new = $html;
    preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result1);
    preg_match_all ( "#</([a-z]+)>#iU", $html, $result2);
    $results_start = $result1[1];
    $results_end = $result2[1];
    foreach($results_start AS $startag){
        if(!in_array($startag, $results_end)){
            $html_new = str_replace('<'.$startag.'>', '', $html_new);
        }
    }
    foreach($results_end AS $endtag){
        if(!in_array($endtag, $results_start)){
            $html_new = str_replace('</'.$endtag.'>', '', $html_new);
        }
    }
    return $html_new;
}

use this function like:

closetag('i <b>love</b> my <strike>cat'); 
#output: i <b>love</b> my cat

closetag('i <b>love</b> my cat</strike>'); 
#output: i <b>love</b> my cat

0 讨论(0)

暗喜

2020-11-30 02:32

Erik Arvidsson wrote a nice HTML SAX parser in 2004. http://erik.eae.net/archives/2004/11/20/12.18.31/

It keeps track of the the open tags, so with a minimalistic SAX handler it's possible to insert closing tags at the correct position:

function tidyHTML(html) {
    var output = '';
    HTMLParser(html, {
        comment: function(text) {
            // filter html comments
        },
        chars: function(text) {
            output += text;
        },
        start: function(tagName, attrs, unary) {
            output += '<' + tagName;
            for (var i = 0; i < attrs.length; i++) {
                output += ' ' + attrs[i].name + '=';
                if (attrs[i].value.indexOf('"') === -1) {
                    output += '"' + attrs[i].value + '"';
                } else if (attrs[i].value.indexOf('\'') === -1) {
                    output += '\'' + attrs[i].value + '\'';
                } else { // value contains " and ' so it cannot contain spaces
                    output += attrs[i].value;
                }
            }
            output += '>';
        },
        end: function(tagName) {
            output += '</' + tagName + '>';
        }
    });
    return output;
}

0 讨论(0)

1 2 下一页