How to extract Article Text contents from HTML page like Pocket (Read It Later) or Readability? [closed]

≯℡__Kan透↙ 提交于 2019-11-28 06:37:48
L.B

I would recommend NReadability, together with HtmlAgilityPack

Main text is always in div with id readInner after NReadability transcoded the page.

//** replace this with any url **
string url = "http://www.bbc.co.uk/news/world-asia-19457334";

var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);

if (b)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(page);

    var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
    var imgUrl = doc.DocumentNode.SelectSingleNode("//meta[@property='og:image']").Attributes["content"].Value;
    var mainText = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']").InnerText;
}

Use the HTML Agilty Pack - it is an open source HTML parser for .NET.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

You can use this to query HTML and extract whatever data you wish.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!