I would like to process my user input to allow only certain html tags, and replace the other ones by their html entities, as well as replace non-tag-characters. For example,
Apply htmlspecialchars and then replace encoded entities with regular entities for a given array of tags
function allow_only($str, $allowed){
$str = htmlspecialchars($str);
foreach( $allowed as $a ){
$str = str_replace("<".$a.">", "<".$a.">", $str);
$str = str_replace("</".$a.">", "</".$a.">", $str);
}
return $str;
}
echo allow_only("This is <b>bold</b> and this is <i>italic</i>.", array("b"));
That works for simple tags, returning "This is bold and this is <i>italic</i>."
As it was pointed out, that doesn't work for tags with attributes, but this does:
function fix_attributes($match){
return "<".$match[1].str_replace('"','"',$match[2]).">";
}
function allow_only($str, $allowed){
$str = htmlspecialchars($str);
foreach( $allowed as $a ){
$str = preg_replace_callback("/<(".$a."){1}([\s\/\.\w=&;:#]*?)>/", fix_attributes, $str);
$str = str_replace("</".$a.">", "</".$a.">", $str);
}
return $str;
}
echo allow_only('This is <b>bold</b> and <a href="http://www.#links">this</a> is <i>italic</i>.', array("b","a"));
that handles more complex tags with certain attributes, only the characters listed between []
are allowed to appear in attributes by this. Unfortunately "
must be allowed within attributes or it won't work, and with it all other entities are allowed too - however only "
in attributes will be decoded.
As it was suggested a much better (safer, cleaner) way to solve problems like this to use a library like http://htmlpurifier.org/demo.php