I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I\'d like to take all of the content and store that and
It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:
string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
sb.AppendLine(node.Text);
}
string final = sb.ToString();
Please, please do not parse HTML yourself! You cannot use just a standard regex to parse HTML - it's not possible.
There are tons of free libraries out there. One of the best free ones in the world of .NET is the HTML Agility Pack.
HTML Agility Pack supports malformed documents as well, which is something that a regex or other basic parsing such as XML will almost never do.
I wrote code to strip out the raw text from markup and present it in my article Convert HTML to Text. The code presented is pretty simple and lightweight.
I also wrote a lightweight HTML parser and have posted it on Github as HTML Monkey. This would be a more complete solution and it would be a simple task to convert the parsed markup to get only the text. I'm still working on this project and am looking for feedback on how it works.
Below function will help to remove all HTML tags, scripts, css, styles from html string and convert it to a plain text. view source
private string GetPlainTextFromHtml(string htmlString)
{
string htmlTagPattern = "<.*?>";
var regexCss = new Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", RegexOptions.Singleline | RegexOptions.IgnoreCase);
htmlString = regexCss.Replace(htmlString, string.Empty);
htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
htmlString = htmlString.Replace(" ", string.Empty);
return htmlString;
}