问题
Hi so i am trying to parse a webpage with HTML Agilty Pack to get the src of an image. This is the structure of the page.
<div class="post_body">
<div style="text-align: center;">
<a href="http://www.engadget.com/2012/02/29/qualcomm-windows-8/">
<img src="http://www.blogcdn.com/www.engadget.com/media/2012/02/201202297192-1330536971.jpg" style="border-width: 0px; border-style: solid; margin: 4px;">
</a>
</div>
<div>
Now I am using this code to attempt to get the src
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://www.engadget.com/2012/02/29/qualcomm-windows-8");
HtmlNode baseNode = doc.DocumentNode.SelectSingleNode("//div[@class='post_content permalink ']");
string Description = baseNode.SelectSingleNode("//div[@class='post_body']").InnerText.Replace("\n", "").Replace("\r", "").Trim();
string href = baseNode.SelectSingleNode("//div[@class='post_body']//img[@src]").InnerText;
However the string is always being returned null :/
Any ideas maybe i have a bad xpath expression?
回答1:
Any ideas maybe i have a bad xpath expression?
Yes, there are a few problems:
//div[@class='post_content permalink ']
This selects nothing, because in the provided document there isnt a div
with class
attribute, whose value is 'post_content permalink '
SelectSingleNode("//div[@class='post_body']//img[@src]").InnerText;
The img
element, even if such is found, has no children -- thus no innerText
.
Solution:
You want something like this:
HtmlNode img = doc.DocumentNode.SelectSingleNode(//div[@class='post_body']//img[@src])
String srcUrl = img.Attributes["src"].Value;
来源:https://stackoverflow.com/questions/9506588/parse-image-src-with-html-agilty-pack