Get plain text from HTML in .NET

后端 未结 5 1743
清酒与你
清酒与你 2020-12-30 21:34

What is the best way to get a plain text string from an HTML string?

public string GetPlainText(string htmlString)
{
    // any .NET built in utility?
}


        
相关标签:
5条回答
  • 2020-12-30 21:53

    There is no built-in solution in the framework.

    If you need to parse HTML I made good experience using a library called HTML Agility Pack.
    It parses an HTML file and provides access to it by DOM, similar to the XML classes.

    0 讨论(0)
  • 2020-12-30 21:54

    You can use MSHTML, which can be pretty forgiving;

    //using microsoft.mshtml
    HTMLDocument htmldoc = new HTMLDocument();
    IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
    htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? &amp; who?" });
    
    string txt = htmldoc2.body.outerText;
    

    Plateau of Leng 2 sugars please what? & who?

    0 讨论(0)
  • 2020-12-30 21:54

    Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.

    Return HttpUtility.HtmlDecode(
                    Regex.Replace(HtmlString, "<(.|\n)*?>", "")
                    )
    

    This removes all the tags, and then decodes any of the extras like &lt; or &gt;

    0 讨论(0)
  • 2020-12-30 22:00

    There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:

    string htmlString = @"<p>I'm HTML!</p>";
    Regex.Replace(htmlString, @"<(.|\n)*?>", "");
    
    0 讨论(0)
  • 2020-12-30 22:05

    There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.

    If you need to remove more than just the tags (i.e., turn &acirc; to â), you can use a more elaborated solution, like found here.

    0 讨论(0)
提交回复
热议问题