问题
I have edited the question for brevity and clarity
My goal is to find and XPath expression that will result in "test1"..."test8" listed separately.
I am working with xpathApply
to extract text from web pages. Due to the layout of various different pages that information will be pulled from, I need to extract the XML values from all <font>
and <p>
html tags. The problem I run into is when one type is nested within the other, resulting in partial duplicates when I use the following xpathApply
expression with an or
condition.
require(XML)
html <-
'<!DOCTYPE html>
<html lang="en">
<body>
<p>test1</p>
<font>test2</font>
<p><font>test3</font></p>
<font><p>test4</p></font>
<p>test5<font>test6</font></p>
<font>test7<p>test8</p></font>
</body>
</html>'
work <- htmlTreeParse(html, useInternal = TRUE, encoding='UTF-8')
table <- xpathApply(work, "//p|//font", xmlValue)
table
It should be easy to see the type of issue that comes with the nesting--because sometimes <font>
and <p>
tags are nested, and sometimes they aren't, I can't ignore them but searching for both gives me partial dupes. For other reasons, I prefer the text pieces to be broken up rather than aggregated (that is, taken from the lowest level/furthest nested tag).
The reason I am not just doing two separate searches and then appending them after removing duplicate strings is that I need to preserve the ordering of text as it appears in the html.
Thanks for reading!
回答1:
Okay, I figured it out (entirely due to this post here:http://www.r-bloggers.com/htmltotext-extracting-text-from-html-via-xpath/)
The answer for me was to just take any text within the html and clean out some stuff not needed, like this:
table <- xpathApply(work, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
回答2:
Looks like this might work
xpathSApply(work, "//body//node()[//p|//font]//text()", xmlValue)
# [1] "test1" "test2" "test3" "test4" "test5" "test6" "test7" "test8"
Just switch to xpathApply
for the list result. We could also use getNodeSet
getNodeSet(work, "//body//node()[//p|//font]//text()", fun = xmlValue)
# [[1]]
# [1] "test1"
#
# [[2]]
# [1] "test2"
#
# [[3]]
# [1] "test3"
#
# [[4]]
# [1] "test4"
#
# [[5]]
# [1] "test5"
#
# [[6]]
# [1] "test6"
#
# [[7]]
# [1] "test7"
#
# [[8]]
# [1] "test8"
来源:https://stackoverflow.com/questions/27244084/r-and-xpathapply-removing-duplicates-from-nested-html-tags