问题
[This question has a relative that lives at: Selective screen scraping with HTMLAgilityPack and XPath ]
I have some HTML to parse which has general appearance as follow:
...
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
...
I am looking for a way where I can parse it down in meaningful chunks like this:
(1), (2), (3), (4), (5), (6), {1}CRLF
(1), (2), (3), (4), (5), (6), {1}CRLF
and so on
I have tried two ways:
way 1:
var dataList = currentDoc.DocumentNode.Descendants("tr")
.Select
(
tr => tr.Descendants("td").Select(td => td.InnerText).ToList()
).ToList();
which fetches me the inner text of the td
s, but fails to fetch the link {1}. Here, a list is created which contains a lot of lists. I can manage it using nested foreach.
way 2:
var dataList = currentDoc.DocumentNode
.SelectNodes("//tr//td//text()|//tr//td//a//@href");
which does get me the link {1} and all data but it becomes unorganized. Here, all the data is present in big chunk. Since, the data in one tr
is relative, I now loose that relation.
So, how can I solve this problem?
回答1:
Following query selects a
element with non-empty href
attribute from each cell. If there is no such element, then inner text of cell is used:
var dataList =
currentDoc.DocumentNode.Descendants("tr")
.Select(tr => from td in tr.Descendants("td")
let a = td.SelectSingleNode("a[@href!='']")
select a == null ? td.InnerText :
a.Attributes["href"].Value);
Feel free to add ToList()
calls.
来源:https://stackoverflow.com/questions/15404670/screen-scraping-with-htmlagilitypack-and-xpath