html-agility-pack

HtmlAgilityPack and large HTML Documents

丶灬走出姿态 提交于 2019-12-14 03:55:37
问题 I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU. I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there. I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :) So,

Using HTML agility pack on WP7.5

一世执手 提交于 2019-12-14 03:48:36
问题 Is there a reference/guide for using HTML Agility Pack on WP7.5? I tried compiling the source on my VS2010 but I wasn't able to reference the DLL created on my local machine. Basically, I'm looking for a text extractor to obtain the text from a given URL. I understand that the HTML Agility Pack works best. Any ideas/suggestions? Thanks :) 回答1: Yes, I was the one who created that discussion. As said in that discussions page, the solution to this problem is to reference the System.Xml.XPath DLL

WebRequest not returning HTML

一世执手 提交于 2019-12-14 03:37:36
问题 I want to load this http://www.yellowpages.ae/categories-by-alphabet/h.html url, but it returns null In some question I have heard about adding Cookie container but it is already there in my code. var MainUrl = "http://www.yellowpages.ae/categories-by-alphabet/h.html"; HtmlWeb web = new HtmlWeb(); web.PreRequest += request => { request.CookieContainer = new System.Net.CookieContainer(); return true; }; web.CacheOnly = false; var doc = web.Load(MainUrl); the website opens perfectly fine in

HTML Agilty for WP7 - Silverlight c#

℡╲_俬逩灬. 提交于 2019-12-14 03:18:16
问题 I am currently trying to parse specfic Tables from a DIV in an HTML doc. I had this working windows Silverlight, but WP7 HTML agility pack seems to be a different thing altogether. HTML Looks like this <div id="FlightInfo_FlightInfoUpdatePanel"> <table cellspacing="0" cellpadding="0"><tbody> <tr class=""> <td class="airline"><img src="/images/airline logos/NZ.gif" title="AIR NEW ZEALAND LIMITED. " alt="AIR NEW ZEALAND LIMITED. " /></td> <td class="flight">NZ8</td> <td class="codeshare"> </td>

c# htmlagilitypack xpath select all except with certain class

我的未来我决定 提交于 2019-12-14 03:12:31
问题 I am trying to select all li tags on a page that do not have the class="r" What i have so far is: .//li This is what ive tried so far //li[not([@class='r'])] With that i get the error: "Expression must evaluate to a node-set." 回答1: use this expression //li[not(@class='r')] var lis = htmlDoc.DocumentNode.SelectNodes("//li[not(@class='r')]") 来源: https://stackoverflow.com/questions/16649928/c-sharp-htmlagilitypack-xpath-select-all-except-with-certain-class

How to get next 2 nodes in HTML + HTMLAgilitypack

可紊 提交于 2019-12-14 02:43:28
问题 I have a table in the HTML code below: <table style="padding: 0px; border-collapse: collapse;"> <tr> <td><h3>My Regional Financial Office</h3></td> </tr> <tr> <td> </td> </tr> <tr> <td><h3>My Address</h3></td> </tr> <tr> <td>000 Test Ave S Ste 000</td> </tr> <tr> <td>Golden Valley, MN 00000</td> </tr> <tr> <td><a href="javascript:submitForm('0000','0000000');">Get Directions</a></td> </tr> <tr> <td> </td> </tr> </table> How can I get the inner text of the next 2 <tr> tags after the tablerow

Xpath and wildcards

一曲冷凌霜 提交于 2019-12-14 02:05:26
问题 I have tried several combinations without success. The full xpath to that data is .//*[@id='detail_row_seek_37878']/td The problem is the number portion '37878' changes for each node and thus I can't use a foreach to loop through the nodes. Is there some way to use a wildcard and reduce the xpath to .//*[@id='detail wildcard , in an effort to bypass the absolute value portion? I am using html agility pack on this. HtmlNode ddate = node.SelectSingleNode(".//*[@id='detail_row_seek_37878']/td");

How to deal with accent problems using HTMLAgilityPack

给你一囗甜甜゛ 提交于 2019-12-13 21:13:21
问题 I'm try to extract the text of a html file, but inside of tag appears the following text: <h3>Café<h3> and when extract the text using the following code : htmlDocument.DocumentNode.SelectSingleNode("some XPath").InnerText; I get this string "Cafédirect" . How could fix this ? 回答1: I've answered this here, basically you can ask HtmlAgilityPack to detect encoding of the HTML document. HTMLAgilityPack Asp.net C# Error Handling 回答2: I know the answer now, working I detect the way to do , here

Windows Phone 8.1 HubApp + HtmlAgilityPack

邮差的信 提交于 2019-12-13 21:12:37
问题 I know that using HAP in windows phone apps is very problematic, but I very need to. So the problem is that when I add System.Xml.XPath from silverlight 5 or 4 I get "Xaml Internal Error error WMC9999". It's got to be noticed that the version of HAP is 1.4.6 but not 1.4.9 (tha latest one), because it cannot be installed from NuGet (just doesn't add reference) and I've found no links to download it manually. In old windows phone 8 silverlight app everything worked great. Please, help. 回答1: Use

InnerText=InnerHtml - How to extract readable text with HtmlAgilityPack

蓝咒 提交于 2019-12-13 19:50:30
问题 I need to extract text from a very bad Html. I'm trying to do this using vb.net and HtmlAgilityPack The tag that I need to parse has InnerText = InnerHtml and both: Name:<!--b>=</b--> Albert E<!--span-->instein s<!--i>Y</i-->ection: 3 room: - While debuging I can read it using "Html viewer": it shows: Name: Albert Einstein section: 3 room: - How can I get this into a string variable? EDIT: I use this code to get the node: Dim ElePs As HtmlNodeCollection = _ mWPage.DocumentNode.SelectNodes("/