What is the best way to get a plain text string from an HTML string?
public string GetPlainText(string htmlString)
{
// any .NET built in utility?
}
There is no built-in solution in the framework.
If you need to parse HTML I made good experience using a library called HTML Agility Pack.
It parses an HTML file and provides access to it by DOM, similar to the XML classes.
You can use MSHTML, which can be pretty forgiving;
//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? & who?" });
string txt = htmldoc2.body.outerText;
Plateau of Leng 2 sugars please what? & who?
Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.
Return HttpUtility.HtmlDecode(
Regex.Replace(HtmlString, "<(.|\n)*?>", "")
)
This removes all the tags, and then decodes any of the extras like <
or >
There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:
string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");
There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.
If you need to remove more than just the tags (i.e., turn &acirc; to â), you can use a more elaborated solution, like found here.