Is there an easy way to take a string of html in JavaScript and strip out the html?
For easier solution, try this => https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/
var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");
You can safely strip html tags using the iframe sandbox attribute.
The idea here is that instead of trying to regex our string, we take advantage of the browser's native parser by injecting the text into a DOM element and then querying the textContent
/innerText
property of that element.
The best suited element in which to inject our text is a sandboxed iframe, that way we can prevent any arbitrary code execution (Also known as XSS).
The downside of this approach is that it only works in browsers.
Here's what I came up with (Not battle-tested):
const stripHtmlTags = (() => {
const sandbox = document.createElement("iframe");
sandbox.sandbox = "allow-same-origin"; // <--- This is the key
sandbox.style.setProperty("display", "none", "important");
// Inject the sanbox in the current document
document.body.appendChild(sandbox);
// Get the sandbox's context
const sanboxContext = sandbox.contentWindow.document;
return (untrustedString) => {
if (typeof untrustedString !== "string") return "";
// Write the untrusted string in the iframe's body
sanboxContext.open();
sanboxContext.write(untrustedString);
sanboxContext.close();
// Get the string without html
return sanboxContext.body.textContent || sanboxContext.body.innerText || "";
};
})();
Usage (demo):
console.log(stripHtmlTags(`<img onerror='alert("could run arbitrary JS here")' src='bogus'>XSS injection :)`));
console.log(stripHtmlTags(`<script>alert("awdawd");</` + `script>Script tag injection :)`));
console.log(stripHtmlTags(`<strong>I am bold text</strong>`));
console.log(stripHtmlTags(`<html>I'm a HTML tag</html>`));
console.log(stripHtmlTags(`<body>I'm a body tag</body>`));
console.log(stripHtmlTags(`<head>I'm a head tag</head>`));
console.log(stripHtmlTags(null));
Below code allows you to retain some html tags while stripping all others
function strip_tags(input, allowed) {
allowed = (((allowed || '') + '')
.toLowerCase()
.match(/<[a-z][a-z0-9]*>/g) || [])
.join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)
var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;
return input.replace(commentsAndPhpTags, '')
.replace(tags, function($0, $1) {
return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
});
}
Using Jquery:
function stripTags() {
return $('<p></p>').html(textToEscape).text()
}
This should do the work on any Javascript environment (NodeJS included).
const text = `
<html lang="en">
<head>
<style type="text/css">*{color:red}</style>
<script>alert('hello')</script>
</head>
<body><b>This is some text</b><br/><body>
</html>`;
// Remove style tags and content
text.replace(/<style[^>]*>.*<\/style>/gm, '')
// Remove script tags and content
.replace(/<script[^>]*>.*<\/script>/gm, '')
// Remove all opening, closing and orphan HTML tags
.replace(/<[^>]+>/gm, '')
// Remove leading spaces and repeated CR/LF
.replace(/([\r\n]+ +)+/gm, '');
I made some modifications to original Jibberboy2000 script Hope it'll be usefull for someone
str = '**ANY HTML CONTENT HERE**';
str=str.replace(/<\s*br\/*>/gi, "\n");
str=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<\s*\/*.+?>/ig, "\n");
str=str.replace(/ {2,}/gi, " ");
str=str.replace(/\n+\s*/gi, "\n\n");