identify the correct CSS selector of a url for an R script

问题

I am trying to obtain data from a website and thanks to a helper i could get to the following script:

require(httr)
require(rvest)
      res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do", 
                    body = list(page = "advancedSearch", 
                                AttachmentExist = "", 
                                family = "", 
                                placeOfPub = "", 
                                genus =      "Arctodupontia", 
                                yearPublished = "", 
                                species ="scleroclada", 
                                author = "", 
                                infraRank = "", 
                                infraEpithet = "", 
                                selectedLevel = "cont"), 
                    encode = "form") 
  pg <- content(res, as="parsed")
  lnks <- html_attr(html_node(pg,"td"), "href")

However, in some cases, like the example above, it does not retrieve the right link because, for some reason, html_attr does not find urls ("href") within the node detected by html_node. So far, i have tried different CSS selector, like "td", "a.onwardnav" and ".plantname" but none of them generate an object that html_attr can handle correctly. Any hint?

回答1:

You are really close on getting the answer your were expecting. If you would like to pull the links off of the desired page then:

lnks <- html_attr(html_nodes(pg,"a"), "href")

will return a list of all of the links at the "a" tag with a "href" attribute. Notice the command is html_nodes and not node. There are multiple "a" tags thus the plural.
If you are looking for the information from the table in the body of then try this:

html_table(pg, fill=TRUE)
#or this
html_nodes(pg,"tr")

The second line will return a list of the 9 rows from the table which one could then parse to obtain the row names ("th") and/or row values ("td").
Hope this helps.

来源：https://stackoverflow.com/questions/36727606/identify-the-correct-css-selector-of-a-url-for-an-r-script

标签

css

web-scraping

rvest

httr