问题
When I use XPath 1.0's substring-before or -after in an expression, something happens that makes my subsequent xmlValue call throw an error. The code below shows that the XPath expression works fine with httr, but then doesn't work with RCurl.
require(XML)
require(httr)
doc <- htmlTreeParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp", useInternal = TRUE)
(string <- xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')", xmlValue, trim = TRUE))
require(RCurl)
fetch <- GET("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
contents <- content(fetch)
locsnodes <- getNodeSet(contents, "//div[@id = 'contactInformation']//p")
sapply(locsnodes, xmlValue)
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n Phone: 432-897-1440\r\n Toll Free: 866-721-6665\r\n Fax: 432-682-3672"
The code above works OK, but I want to use substring-before it to clean up the result like this:
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "
locsnodes <- getNodeSet(contents, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")
sapply(locsnodes, xmlValue)
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "character"
How can I use substring-
and also RCurl, because RCurl is the chosen package for a more complicate operation used later?
Thank you for any guidance (or better way to achieve what I want
回答1:
The fun
argument in xpathSApply
or indeed getNodeSet
is only called if a node set is returned. In your case a character string is being returned and the function is ignored:
require(XML)
require(RCurl)
doc <- htmlParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
locsnodes <- getNodeSet(doc
, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")
> locsnodes
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "
> str(locsnodes)
chr "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "
The fun
argument is not being used here in xpathSApply
> xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')"
+ , function(x){1}
+ )
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "
as your xpath is not returning a node set.
回答2:
Here's a slightly different approach using the rvest package. I think you're generally better off doing string manipulation in R, rather than in xpath
library(rvest)
contact <- html("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
contact %>%
html_node("#contactInformation p") %>%
html_text() %>%
gsub(" Phone.*", "", .)
#> [1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n"
来源:https://stackoverflow.com/questions/26202615/why-different-results-with-xpath-1-0-and-rcurl-vs-httr-using-substring-before-a