问题
Consider this page:
<n1 class="a">
1
</n1>
<n1 class="b">
<b>bold</b>
2
</n1>
If I first select the first n1
using class="a"
, I should be excluding the second n1
, and indeed this appears true:
library(rvest)
b_nodes = read_html('<n1 class="a">1</n1>
<n1 class="b"><b>bold</b>2</n1>') %>%
html_nodes(xpath = '//n1[@class="b"]')
b_nodes
# {xml_nodeset (1)}
# [1] <n1 class="b"><b>bold</b>2</n1>
However if we now use this "subsetted" page:
b_nodes %>% html_nodes(xpath = '//n1')
# {xml_nodeset (2)}
# [1] <n1 class="a">1</n1>
# [2] <n1 class="b"><b>bold</b>2</n1>
How did the 1
node get "re-discovered"??
Note: I know how to get what I want with two separate xpaths. This is a conceptual question about why the "subsetting" didn't work as expected. My understanding was that b_nodes
should have excluded the first node altogether -- the b_nodes
object shouldn't even know that node exists.
回答1:
html_nodes(xpath = '//n1')
//
is short for /descendant-or-self::n1
, the current node is the whole document
change it to .//n1
, .
means the current node is what you selected before
回答2:
I am not shure what are you trying to do, but, Why do not you try to traverse the nodes with a foreach? I mean:
$XML = read_html('
<n1s>
<n1 class="a">1</n1>
<n1 class="b"><b>bold</b>2</n1></n1s>') %>%
$valueA = '';
$valueB = '';
foreach ($XML->xpath('//n1') as $n1) {
switch ((string)$n1['class']){
case 'a':
$valueA = $XML->n1;
break;
case 'b':
$valueB = $XML->n1;
break;
}
}
I hope this can help you. Regards!
来源:https://stackoverflow.com/questions/42167159/why-does-xpath-find-excluded-nodes-again