html-agility-pack

Get a specific option in HtmlAgilityPack?

浪子不回头ぞ 提交于 2020-01-30 11:34:19
问题 is possible get with HtmlAgilityPack a specific option? For example I've a select like this: <select id="foo"> <option value="0">1</option> <option value="1" selected="selected">2</option> </selected> I need to get the option with selected. I know how to get the option with: doc.DocumentNode.SelectNodes("//select[@id='foo']//option"); 回答1: This should work: doc.DocumentNode.SelectNodes("//select[@id='foo']/option[@selected='selected']"); You can read more about xpath here 回答2: doc

How to get html elements with multiple css classes

邮差的信 提交于 2020-01-26 12:46:45
问题 I know how to get a list of DIVs of the same css class e.g <div class="class1">1</div> <div class="class1">2</div> using xpath //div[@class='class1'] But how if a div have multiple classes, e.g <div class="class1 class2">1</div> What will the xpath like then? 回答1: The expression you're looking for is: //div[contains(@class, 'class1') and contains(@class, 'class2')] I highly suggest XPath visualizer, which can help you debug xpath expressions easily. It can be found here: http:/

Trouble Scraping Web Page With Malformed Content

若如初见. 提交于 2020-01-25 23:13:51
问题 I have written c# code which utilizes the HtmlAgilityPack library in order to scrape a page located at: World's Largest Urban Areas (Page 2). Unfortunately the page consists of malformed content. I'm at an impasse on how to scrape this page. The current code I have (appearing below) freezes on parsing the HTML: HtmlNodeCollection cityRecords = _htmlDocument.DocumentNode.SelectNodes("//table[@class='boldtable']//tr[position() != 1]"); CityNodes = (from node in cityRecords.Descendants() where

Trouble Scraping Web Page With Malformed Content

主宰稳场 提交于 2020-01-25 23:13:49
问题 I have written c# code which utilizes the HtmlAgilityPack library in order to scrape a page located at: World's Largest Urban Areas (Page 2). Unfortunately the page consists of malformed content. I'm at an impasse on how to scrape this page. The current code I have (appearing below) freezes on parsing the HTML: HtmlNodeCollection cityRecords = _htmlDocument.DocumentNode.SelectNodes("//table[@class='boldtable']//tr[position() != 1]"); CityNodes = (from node in cityRecords.Descendants() where

HtmlAgilityPack Select individual elements from a list of divs

南楼画角 提交于 2020-01-25 11:52:29
问题 I am trying to scrape using the HtmlAgilityPack child elements from a list of divs. The most parent Div is //div[@class='cell in-area-cell middle-cell'] and if I simply iterate through the list I can display all the child content from each parent fine. But I don't want to display all the content, I would like to pick certain div's, p's and a's from each of the children but the code below is only giving me a list of the first //a[@class='listing-name'] . It gives me the correct number of

html agility pack question in parsing

寵の児 提交于 2020-01-24 01:14:11
问题 I have this simple string: string testString = "6/21 <span style='font-size: x-small; font-family: Arial'><span style='font-size: 10pt; font-family: Arial'>Just got 78th street</span></span>"; how do i use the html agility pack to parse out just the text. please note: there is a span nested inside another span. thanks, rod. 回答1: I think the InnertText property should give just the text - var testString = "6/21 <span style='font-size: x-small; font-family: Arial'><span style='font-size: 10pt;

HTML agility pack - removing unwanted tags without removing content?

筅森魡賤 提交于 2020-01-18 07:14:44
问题 I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing. I want to use the HTML Agility Pack to remove unwanted tags from my HTML without losing the content within the tags. So for instance, in my scenario, I would like to preserve the tags " b ", " i " and " u ". And for an input like: <p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p> The resulting HTML should be: my paragraph and my <b>div</b> are <i>italic<

Scraping using Html Agility Package

怎甘沉沦 提交于 2020-01-17 01:42:05
问题 I am trying to scrape data from a news article using HtmlAgilityPackage the link is as follows http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528 I have written the following code below to extract all the comments in this articles but for some reason my variable aTags is returning null value Code: var getHtmlWeb = new HtmlWeb(); var document = getHtmlWeb.Load(txtinputurl.Text); var aTags = document.DocumentNode.SelectNodes("//div[

Xpath table changes as combobox changes too

我与影子孤独终老i 提交于 2020-01-16 09:09:10
问题 I'm working on an application in C# that goes to a website and gets some content out of a table. It's working fine, but here is the problem: the table that I'm getting the content of changes as I select a different value in a combobox. The Xpath that I use always gets the table that is first shown on the website and I don't know how to get the other ones. I'm posting here everything I think is useful for you to help me. The webpage is: http://br.soccerway.com/national/brazil/serie-a/2012

Download all PDF files from crawled links

倾然丶 夕夏残阳落幕 提交于 2020-01-16 08:27:33
问题 While running code it says that ProductListPage is null and after dropping an error does not proceed forward. Any ideas how to solve this issue? Wait until //div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a is found or something else? Here is my current code: HtmlDocument htmlDoc = new HtmlWeb().Load("https://example.com/"); HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4