问题
I am trying to scrape data from a news article using HtmlAgilityPackage the link is as follows http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528
I have written the following code below to extract all the comments in this articles but for some reason my variable aTags is returning null value
Code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(txtinputurl.Text);
var aTags = document.DocumentNode.SelectNodes("//div[@class='com_user_text']");
int counter = 1;
if (aTags != null)
{
foreach (var aTag in aTags)
{
lbloutput.Text += lbloutput.Text + ". " + aTag.InnerHtml + "\t" + "<br />";
counter++;
}
}
I have also used this XPath but still the same result //div[@class='newcomment_list']/ul/li/div[@class='headerwrap']/div[@class='com_user_text'] Please help me with the correct Xpath to Extract all the comments Searched all over the net but no solution.
回答1:
Do a 'View Source' on the page and search for com_user_text
. The user comments don't appear at all. They are loaded via javascript after the page is loaded. So when you load the page content via getHtmlWeb.Load()
, you don't get user comments.
As this answer says, HTML Agility is not a tool capable of emulating a browser and running javascript. Instead, you need something like WatiN that "allows programmatic access to web pages through a given browser engine and will load the full document."
来源:https://stackoverflow.com/questions/31411942/scraping-using-html-agility-package