Scraping using Html Agility Package

怎甘沉沦 提交于 2020-01-17 01:42:05

问题


I am trying to scrape data from a news article using HtmlAgilityPackage the link is as follows http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528

I have written the following code below to extract all the comments in this articles but for some reason my variable aTags is returning null value

Code:

var getHtmlWeb = new HtmlWeb();
        var document = getHtmlWeb.Load(txtinputurl.Text);
        var aTags =    document.DocumentNode.SelectNodes("//div[@class='com_user_text']");
        int counter = 1;
        if (aTags != null)
        {
            foreach (var aTag in aTags)
            {
                lbloutput.Text += lbloutput.Text + ". " + aTag.InnerHtml + "\t" + "<br />";
                counter++;
            }
        }

I have also used this XPath but still the same result //div[@class='newcomment_list']/ul/li/div[@class='headerwrap']/div[@class='com_user_text'] Please help me with the correct Xpath to Extract all the comments Searched all over the net but no solution.


回答1:


Do a 'View Source' on the page and search for com_user_text. The user comments don't appear at all. They are loaded via javascript after the page is loaded. So when you load the page content via getHtmlWeb.Load(), you don't get user comments.

As this answer says, HTML Agility is not a tool capable of emulating a browser and running javascript. Instead, you need something like WatiN that "allows programmatic access to web pages through a given browser engine and will load the full document."



来源:https://stackoverflow.com/questions/31411942/scraping-using-html-agility-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!