Is there an easy way to take a string of html in JavaScript and strip out the html?
I altered Jibberboy2000's answer to include several
tag formats, remove everything inside and
tags, format the resulting HTML by removing multiple line breaks and spaces and convert some HTML-encoded code into normal. After some testing it appears that you can convert most of full web pages into simple text where page title and content are retained.
In the simple example,
This is my title
This string has html code i want to remove
In this line BBC with link is mentioned.
Now back to "normal text" and stuff using <html encoding>
becomes
This is my title
This string has html code i want to remove
In this line BBC (http://www.bbc.co.uk) with link is mentioned.
Now back to "normal text" and stuff using
The JavaScript function and test page look this:
function convertHtmlToText() {
var inputText = document.getElementById("input").value;
var returnText = "" + inputText;
//-- remove BR tags and replace them with line break
returnText=returnText.replace(/
/gi, "\n");
returnText=returnText.replace(/
/gi, "\n");
returnText=returnText.replace(/
/gi, "\n");
//-- remove P and A tags but preserve what's inside of them
returnText=returnText.replace(//gi, "\n");
returnText=returnText.replace(/(.*?)<\/a>/gi, " $2 ($1)");
//-- remove all inside SCRIPT and STYLE tags
returnText=returnText.replace(/[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
returnText=returnText.replace(/[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
//-- remove all else
returnText=returnText.replace(/<(?:.|\s)*?>/g, "");
//-- get rid of more than 2 multiple line breaks:
returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");
//-- get rid of more than 2 spaces:
returnText = returnText.replace(/ +(?= )/g,'');
//-- get rid of html-encoded characters:
returnText=returnText.replace(/ /gi," ");
returnText=returnText.replace(/&/gi,"&");
returnText=returnText.replace(/"/gi,'"');
returnText=returnText.replace(/</gi,'<');
returnText=returnText.replace(/>/gi,'>');
//-- return
document.getElementById("output").value = returnText;
}
It was used with this HTML: