HtmlAgilityPack and large HTML Documents

丶灬走出姿态 提交于 2019-12-14 03:55:37

问题


I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.

I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.

I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :) So, now i am 99% certain it has to do with the following piece of code:

HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;

documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[@href]");

All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?

I was thinking maybe limit what i fetch? What would be my best option here?

Certainly someone must have run into this problem before :)


回答1:


".//a[@href]" is extremely slow XPath. Tried to replace with "//a[@href]" or with code that simply walks whole document and checks all A nodes.

Why this XPath is slow:

  1. "." starting with a node
  2. "//" select all descendent nodes
  3. "a" - pick only "a" nodes
  4. "@href" with href.

Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.




回答2:


Have you tried dropping the XPath and using the LINQ functionality?

var list = documentt.DocumentNode.Descendants("a").Select(n => n.GetAttributeValue("href", string.Empty);

That'll pull a list of the href attribute of all anchor tags as a List<string>.




回答3:


If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.

CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.



来源:https://stackoverflow.com/questions/12804281/htmlagilitypack-and-large-html-documents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!