Issue with HTMLAgilityPack parsing HTML using C#

前端未结

关注

 3  1593

I\'m just trying to learn about HTMLAgilityPack and XPath, I\'m attempting to get a list of (HTML Links) companies from the NASDAQ website;

http://www.nasdaq.com/quotes/

相关标签:

3条回答

你的背包

2021-01-21 06:32
Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.
```
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
  //Using Regex here to get just the array we're interested in...
  string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
  JArray jArray = JArray.Parse(stockArray);
  foreach (JToken token in jArray.Children())
  {
    listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
  }
}
```
To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [.... Each stock is one element in the array and is an array itself.

["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2021-01-21 06:39
Why won't you just use Descendants("a") method? It's much simplier and is more object oriented. You'll just get a bunch of objects. The you can just get the "href" attribute from those objects.

Sample code:
```
htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
```
If you just need list of links from certain webpage, this method will do just fine.
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2021-01-21 06:41

If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.

0 讨论(0)
发布评论:

提交评论
- 加载中...