Whenever we are fetching some user inputed content with some editing from the database or similar sources, we might retrieve the portion which only contains the opening tag
I have solution for php
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
You can use this function like
<?php echo closetags("your content <p>test test"); ?>
In addition to server-side tools like Tidy, you can also use the user's browser to do some of the cleanup for you. One of the really great things about innerHTML
is that it will apply the same on-the-fly repair to dynamic content as it does to HTML pages. This code works pretty well (with two caveats) and nothing actually gets written to the page:
var divTemp = document.createElement('div');
divTemp.innerHTML = '<p id="myPara">these <i>tags aren\'t <strong> closed';
console.log(divTemp.innerHTML);
The caveats:
The different browsers will return different strings. This isn't so bad, except in the the case of IE, which will return capitalized tags and will strip the quotes from tag attributes, which will not pass validation. The solution here is to do some simple clean-up on the server side. But at least the document will be properly structured XML.
I suspect that you may have to put in a delay before reading the innerHTML -- give the browser a chance to digest the string -- or you risk getting back exactly what was put in. I just tried on IE8 and it looks like the string gets parsed immediately, but I'm not so sure on IE6. It would probably be best to read the innerHTML after a delay (or throw it into a setTimeout() to force it to the end of the queue).
I would recommend you take @Gordon's advice and use Tidy if you have access to it (it takes less work to implement) and failing that, use innerHTML and write your own tidy function in PHP.
And though this isn't part of your question, as this is for a CMS, consider also using the YUI 2 Rich Text Editor for stuff like this. It's fairly easy to implement, somewhat easy to customize, the interface is very familiar to most users, and it spits out perfectly valid code. There are several other off-the-shelf rich text editors out there, but YUI has the best license and is the most powerful I've seen.