Why different results with XPath 1.0 and RCurl vs httr, using substring-before an expression

混江龙づ霸主 提交于 2019-12-25 08:57:48

问题


When I use XPath 1.0's substring-before or -after in an expression, something happens that makes my subsequent xmlValue call throw an error. The code below shows that the XPath expression works fine with httr, but then doesn't work with RCurl.

require(XML)
require(httr)
doc <- htmlTreeParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp", useInternal = TRUE)
(string <- xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')", xmlValue, trim = TRUE))


require(RCurl)
fetch <- GET("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
contents <- content(fetch)
locsnodes <- getNodeSet(contents, "//div[@id = 'contactInformation']//p")  
sapply(locsnodes, xmlValue)

[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n Phone: 432-897-1440\r\n Toll Free: 866-721-6665\r\n Fax: 432-682-3672"

The code above works OK, but I want to use substring-before it to clean up the result like this:

[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

locsnodes <- getNodeSet(contents, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")  
sapply(locsnodes, xmlValue)

Error in UseMethod("xmlValue") : 
  no applicable method for 'xmlValue' applied to an object of class "character"

How can I use substring- and also RCurl, because RCurl is the chosen package for a more complicate operation used later?

Thank you for any guidance (or better way to achieve what I want


回答1:


The fun argument in xpathSApply or indeed getNodeSet is only called if a node set is returned. In your case a character string is being returned and the function is ignored:

require(XML)
require(RCurl)
doc <- htmlParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
locsnodes <- getNodeSet(doc
                        , "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")  
> locsnodes
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

> str(locsnodes)
 chr "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

The fun argument is not being used here in xpathSApply

> xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')"
+             , function(x){1}
+ )
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

as your xpath is not returning a node set.




回答2:


Here's a slightly different approach using the rvest package. I think you're generally better off doing string manipulation in R, rather than in xpath

library(rvest)

contact <- html("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")

contact %>%
  html_node("#contactInformation p") %>%
  html_text() %>%
  gsub(" Phone.*", "", .)
#> [1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n"


来源:https://stackoverflow.com/questions/26202615/why-different-results-with-xpath-1-0-and-rcurl-vs-httr-using-substring-before-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!