Why would Html.AgilityPack miss some image tags?

泄露秘密 提交于 2019-12-24 08:31:38

问题


I am using the html agility pack and did something like this

HtmlWeb web = new HtmlWeb();
 HtmlDocument doc = web.Load("http://test.com");

int count = doc.DocumentNode.SelectNodes("//img").Count();

I get 38 back.

When I go to that page and do $('img').size(); I get 43 back. Why is there a difference? Is "//img" just looking for root ones?

Is that why I might be missing some?


回答1:


Is "//img" just looking for root ones?

No it looking for descendant nodes (children, grandchildren, etc. of the current node). Your xpath expression selects all the images from the document.

When I go to that page and do $('img').size(); I get 43 back.

My assumption - some of the images are created dynamically via javascript. HtmlAgilityPack cannot handle this.

By the way, for the http://test.com I got 87 image nodes with AgilityPack (doc.DocumentNode.SelectNodes("//img").Count()), and 87 image nodes from the Chome console ($('img').size()).

EDIT: HtmlWeb.Load() method internally uses WebRequest class to get data. The role of AgilityPack is to parse the data correctly. It's completely possible that some web resources return different content for the same URI depending on some of request headers like User-Agent and others. E.g. User-Agent header could be set via HtmlWeb.UserAgent property.



来源:https://stackoverflow.com/questions/9729026/why-would-html-agilitypack-miss-some-image-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!