In my code I have a parent DOM element docElem
. This is an iframe containing a complete HTML document. Now I want to remove all inline JavaScript.
You can use the html-sanitizer from the Google Caja project. It can be used stand-alone in the browser.
You can get it from:
http://caja.appspot.com/html-css-sanitizer-minified.js
or:
http://caja.appspot.com/html-sanitizer-minified.js
(depending on whether or not you need to sanitize css as well)
You have to define two functions to tell the sanitizer how you want it to treat URLs and elements IDs (I'll name them sanUrl()
and sanId()
here).
For example you may want to completely remove IDs so that they don't interfere with your own IDs:
function sanId(id) {
return undefined;
}
or you may want to add some prefix:
function sanId(id) {
return "PREFIX" + id;
}
or just use them unchanged if it's ok for you:
function sanId(id) {
return id;
}
The same with URLs:
function sanUrl(url) {
// sanitize urls if needed
// eg. add a prefix or remove relative/absolute urls etc.
return url;
}
Now you can use the html_sanitize()
function like this:
var sanitizedHtml = html_sanitize(originalHtml, sanUrl, sanId);
It will strip much more than what you described which means that you won't get into trouble if you have some input that you haven't anticipated.
It will also strip the html, head and body tags so if you need them you can add:
fullHtml = "" + sanitizedHtml + "";
You can also eg. get the image URLs using a code like this:
$(sanitizedHtml).find('img').addBack().filter('img')
.each(function (i, el) {
var url = $(el).attr('src');
// do something with the URL:
alert(url);
});
See this demo:
http://codepen.io/rsp/pen/hLmcE