I\'m using FreeTextBox editor to get some HTML created by users. The problem with this is this editor is not converting special chars in HTML entities at exception of \"<
If you've got a mixture of <
meaning start a tag and <
meaning a literal less-than sign, you can't possibly tell which is ‘a tag’ to ignore and which isn't.
About all you could do would be to detect <
usages that weren't a conventionally-formed start or end tag, using a nasty unreliable regex something like:
<(?!\w+(\s+\w+="[^"<]*")*\s*/?>|/\w+\s*>)
and replace them with <
. Similarly for &
with &
:
&(?!\w+;|#\d+;|#x[0-9A-Fa-f]+;)
(>
does not normally have to be escaped.)
This won't allow every possible valid way of constructing elements, and it will allow broken mis-nested elements, and non-existent entities, and would mess up non-element constructs like comments. Because regex can't parse HTML, let alone HTML with added crunchy broken bits.
So it's hardly foolproof. If you want proper markup that won't break your page when they accidentally leave a div open, the best first step is to parse it as XHTML and refuse it with an error if it's not well-formed XML.
If you have a rich text editor component that generates output where a literal <
is not escaped, then it's time to replace that component with something less appalling. But in general it's not a good idea to let users create HTML, because they're really rubbish at it. Plus allowing anyone to input HTML gives them complete control over wrecking the site and its security with JavaScript. A simpler text-markup language is often a win.