Parse inner HTML

大城市里の小女人 提交于 2019-12-12 21:46:27

问题


This is what I want to parse

<div class="photoBox pB-ms">
<a href="/user_details?userid=ePDZ9HuMGWR7vs3kLfj3Gg">
<img width="100" height="100" alt="Photo of Debbie K." src="http://s3-media2.px.yelpcdn.com/photo/xZab5rpdueTCJJuUiBlauA/ms.jpg">
</a>
</div>

I am using following XPath to find it

HtmlNodeCollection bodyNode = htmlDoc.DocumentNode.SelectNodes("//div[@class='photoBox pB-ms']");

This is fine and return,s me all div,s with photobox class

But when I want to get ahref using

HtmlNodeCollection bodyNode = htmlDoc.DocumentNode.SelectNodes("//div[@class='photoBox pB-ms'//a href]");

I got error invalid token.

Also I tried using query

   var lowestreview =
  from main in htmlDoc.DocumentNode.SelectNodes("//div[@class='photoBox pB-ms']") 
   from rating in main.SelectNodes("//a href")
  select new { Main=main.Attributes[0].Value,AHref = rating.ToString() };

Will anybody tell me how to write XPath or query to get this AHref


回答1:


This works (tested):

HtmlNodeCollection bodyNodes = htmlDoc.DocumentNode
                                      .SelectNodes("//div[@class='photoBox pB-ms']/a[@href]");
foreach(var node in bodyNodes)
{
    string href = node.Attributes["href"].Value;
}

The problem is that you had attribute and element selectors mixed up. Also from you question its unclear whether you really intended to query for a collection.

The XPath selector above will select all a elements that have an href attribute that are child nodes of a div element with a class of 'photoBox pB-ms'. You can then iterate this collection and get the href attribute value of each element.

Also HtmlAgilityPack now supports Linq (since 1.4), so just getting a particular attribute value could be done much easier (imo) like this:

string hrefValue = htmlDoc.DocumentNode
                          .Descendants("div")
                          .Where(x => x.Attributes["class"].Value == "photoBox pB-ms")
                          .Select(x => x.Element("a").Attributes["href"].Value)
                          .FirstOrDefault();



回答2:


Instead of XML Parsing you can use HTMLAgilePack

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml([HTML Text]);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
    HtmlAttribute att = link["href"];
    // att.Value
}


来源:https://stackoverflow.com/questions/6838947/parse-inner-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!