Issue with HTMLAgilityPack parsing HTML using C#

前端 未结 3 1593
夕颜
夕颜 2021-01-21 05:54

I\'m just trying to learn about HTMLAgilityPack and XPath, I\'m attempting to get a list of (HTML Links) companies from the NASDAQ website;

http://www.nasdaq.com/quotes/

相关标签:
3条回答
  • 2021-01-21 06:32

    Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
    List<string> listStocks = new List<string>();
    HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
    if (scriptNode != null)
    {
      //Using Regex here to get just the array we're interested in...
      string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
      JArray jArray = JArray.Parse(stockArray);
      foreach (JToken token in jArray.Children())
      {
        listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
      }
    }
    

    To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [.... Each stock is one element in the array and is an array itself.

    ["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

    So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.

    0 讨论(0)
  • 2021-01-21 06:39

    Why won't you just use Descendants("a") method? It's much simplier and is more object oriented. You'll just get a bunch of objects. The you can just get the "href" attribute from those objects.

    Sample code:

    htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
    

    If you just need list of links from certain webpage, this method will do just fine.

    0 讨论(0)
  • 2021-01-21 06:41

    If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.

    0 讨论(0)
提交回复
热议问题