HTMLAgilityPack load AJAX content for scraping

混江龙づ霸主 提交于 2019-12-21 19:20:02

问题


Im trying to scrape a webpage using HTMLAgilityPack in a c# webforms project.

All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.

At present Im calling the required page via this code:

var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[@class=\"nav\"]");

An example bit of code that Ive seen saying to use the WebBrowser control:

if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);

Any suggestions / pointers as to how to grab the page once AJAX has been loaded, will be appreciated.


回答1:


It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.

Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.



来源:https://stackoverflow.com/questions/24907125/htmlagilitypack-load-ajax-content-for-scraping

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!