I have written some code that takes a string of html and cleans away any ugly HTML from it using jQuery (see an early prototype in this SO question). It works pretty well, but I
You should remove the script
elements:
var wrapper = $('<div/>').append($(html).remove("script"));
Second attempt:
node-validator can be used in the browser: https://github.com/chriso/node-validator
var str = sanitize(large_input_str).xss();
Alternatively, PHPJS has a strip_tags function (regex/evil based): http://phpjs.org/functions/strip_tags:535
How about removing the scripts first?
var wrapper = $('<div/>').append($(html).not('script'));
Assuming script elements in the html are not nested in other elements:
var wrapper = document.createElement('div');
wrapper.innerHTML = html;
$(wrapper).children().remove('script');
var wrapper = document.createElement('div');
wrapper.innerHTML = html;
$(wrapper).find('script').remove();
This works for the case where html is just text and where html has text outside any elements.