WebScraping dynamic pages in R

问题

I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/

I'm trying to use simple Rvest as thought that would be all needed here. SelectorGadget is giving links CSS as .article-title a, so my code is simply

url %>%
  read_html() %>% 
  html_nodes(".article-title a") %>% 
  html_text()

This will return nothing. Website loads more results when you scroll down, but I thought I would atleast get first results out. Below gives out some links and links 28:32 looks promising, but I think they are links from the sidebar, not from article.

url %>%
  read_html() %>% 
  html_nodes("a") %>% 
  html_attr("href")

What I'm I doing wrong here and can RSelenium help me?

回答1:

Here is my partial answer, still not getting all, but maybe helps some one. Code will return 1 link for first result. Not sure why it isn't giving them all. I'm using

library(RSelenium)
rD <- rsDriver(port = 4444L,  browser = "chrome")

remDr <- rD[["client"]]
remDr$navigate("http://ravimaailma.fi/cg/tulokset/20/")

elem <- remDr$findElement(using="css selector", value=".article-title a")
elemtxt <- elem$getElementAttribute("href")

#Click button to load more results
#button <- remDr$findElement(using="id", value="loadmore")
#button$clickElement()

remDr$close()

I haven't used button click yet, but seemed that it was working as well. Only problem is that I can't get all results from the site.

回答2:

[I'm not (yet) allowed to write comments, so I chose to make this post an answer] RSelenium is not always necessary, you can also interact with a website using directly PhantomJS (see e.g. this example).

If you provided an example from the website instead of a local link to a .pdf, I can try to find out how to retrieve the data.

来源：https://stackoverflow.com/questions/45585575/webscraping-dynamic-pages-in-r

标签

scrape