Whenever we are fetching some user inputed content with some editing from the database or similar sources, we might retrieve the portion which only contains the opening tag
I used to the native DOMDocument method, but with a few improvements for safety.
Note, other answers that use DOMDocument do not consider html strands such as
This is a <em>HTML</em> strand
The above will actually result in
<p>This is a <em>HTML</em> strand
My Solution is below
function closeDanglingTags($html) {
if (strpos($html, '<') || strpos($html, '>')) {
// There are definitiley HTML tags
$wrapped = false;
if (strpos(trim($html), '<') !== 0) {
// The HTML starts with a text node. Wrap it in an element with an id to prevent the software wrapping it with a <p>
// that we know nothing about and cannot safely retrieve
$html = cHE::getDivHtml($html, null, 'closedanglingtagswrapper');
$wrapped = true;
}
$doc = new DOMDocument();
$doc->encoding = 'utf-8';
@$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
if ($doc->firstChild) {
// Test whether the firstchild is definitely a DOMDocumentType
if ($doc->firstChild instanceof DOMDocumentType) {
// Remove the added doctype
$doc->removeChild($doc->firstChild);
}
}
if ($wrapped) {
// The contents originally started with a text node and was wrapped in a div#plasmappclibtextwrap. Take the contents
// out of that div
$node = $doc->getElementById('closedanglingtagswrapper');
$children = $node->childNodes; // The contents of the div. Equivalent to $('selector').children()
$doc = new DOMDocument(); // Create a new document to add the contents to, equiv. to "var doc = $('<html></html>');"
foreach ($children as $childnode) {
$doc->appendChild($doc->importNode($childnode, true)); // E.g. doc.append()
}
}
// Remove the added html,body tags
return trim(str_replace(array('<html><body>', '</body></html>'), '', html_entity_decode($doc->saveHTML())));
} else {
return $html;
}
}
For HTML fragments, and working from KJS's answer I have had success with the following when the fragment has one root element:
$dom = new DOMDocument();
$dom->loadHTML($string);
$body = $dom->documentElement->firstChild->firstChild;
$string = $dom->saveHTML($body);
Without a root element this is possible (but seems to wrap only the first text child node in p tags in text <p>para</p> text
):
$dom = new DOMDocument();
$dom->loadHTML($string);
$bodyChildNodes = $dom->documentElement->firstChild->childNodes;
$string = '';
foreach ($bodyChildNodes as $node){
$string .= $dom->saveHTML($node);
}
Or better yet, from PHP >= 5.4 and libxml >= 2.7.8 (2.7.7 for LIBXML_HTML_NOIMPLIED
):
$dom = new DOMDocument();
// Load with no html/body tags and do not add a default dtd
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$string = $dom->saveHTML();
Found a great answer for this one:
Use PHP 5 and use the loadHTML() method of the DOMDocument object. This auto parses badly formed HTML and a subsequent call to saveXML() will output the valid HTML. The DOM functions can be found here:
http://www.php.net/dom
The usage of this:
$doc = new DOMDocument();
$doc->loadHTML($yourText);
$yourText = $doc->saveHTML();
You can use Tidy:
Tidy is a binding for the Tidy HTML clean and repair utility which allows you to not only clean and otherwise manipulate HTML documents, but also traverse the document tree.
or HTMLPurifier
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.
A better PHP function to delete not open/not closed tags from webmaster-glossar.de (me)
function closetag($html){
$html_new = $html;
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result1);
preg_match_all ( "#</([a-z]+)>#iU", $html, $result2);
$results_start = $result1[1];
$results_end = $result2[1];
foreach($results_start AS $startag){
if(!in_array($startag, $results_end)){
$html_new = str_replace('<'.$startag.'>', '', $html_new);
}
}
foreach($results_end AS $endtag){
if(!in_array($endtag, $results_start)){
$html_new = str_replace('</'.$endtag.'>', '', $html_new);
}
}
return $html_new;
}
use this function like:
closetag('i <b>love</b> my <strike>cat');
#output: i <b>love</b> my cat
closetag('i <b>love</b> my cat</strike>');
#output: i <b>love</b> my cat
Erik Arvidsson wrote a nice HTML SAX parser in 2004. http://erik.eae.net/archives/2004/11/20/12.18.31/
It keeps track of the the open tags, so with a minimalistic SAX handler it's possible to insert closing tags at the correct position:
function tidyHTML(html) {
var output = '';
HTMLParser(html, {
comment: function(text) {
// filter html comments
},
chars: function(text) {
output += text;
},
start: function(tagName, attrs, unary) {
output += '<' + tagName;
for (var i = 0; i < attrs.length; i++) {
output += ' ' + attrs[i].name + '=';
if (attrs[i].value.indexOf('"') === -1) {
output += '"' + attrs[i].value + '"';
} else if (attrs[i].value.indexOf('\'') === -1) {
output += '\'' + attrs[i].value + '\'';
} else { // value contains " and ' so it cannot contain spaces
output += attrs[i].value;
}
}
output += '>';
},
end: function(tagName) {
output += '</' + tagName + '>';
}
});
return output;
}