Jeff actually posted about this in Sanitize HTML. But his example is in C# and I\'m actually more interested in a Java version. Does anyone have a better version for Java? I
For java, I used the following regular expression with replaceAll, and worked for me
value.replaceAll("(?i)(\\b)(on\\S+)(\\s*)=|javascript:|(<\\s*)(\\/*)script|style(\\s*)=|(<\\s*)meta", "");
Added (?i) to ignore case for alphabets.
[\s\w\.]*
. If it doesn't match, you've got XSS. Maybe. Take note that this expression only allows letters, numbers, and periods. It avoids all symbols, even useful ones, out of fear of XSS. Once you allow &, you've got worries. And merely replacing all instances of & with &
is not sufficient. Too complicated to trust :P. Obviously this will disallow a lot of legitimate text (You can just replace all nonmatching characters with a ! or something), but I think it will kill XSS.
The idea to just parse it as html and generate new html is probably better.
The biggest problem by using jeffs code is the @ which currently isnt available.
I would probably just take the "raw" regexp from jeffs code if i needed it and paste it into
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
and see the things needing escape get escaped and then use it.
Taking the usage of this regex in mind I would personally make sure I understood exactly what I was doing, why and what consequences would be if I didnt succeed, before copy/pasting anything, like the other answers try to help you with.
(Thats propbably pretty sound advice for any copy/paste)