R Scrape a list of Google + urls using purrr package

淺唱寂寞╮ 提交于 2019-12-13 02:54:34

问题


I am working on a web scraping project, which aims to extract Google + reviews from a set of children's hospitals. My methodology is as follows:

1) Define a list of Google + urls to navigate to for review scraping. The urls are in a dataframe along with other variables defining the hospital.

2) Scrape reviews, number of stars, and post time for all reviews related to a given url.

3) Save these elements in a dataframe, and name the dataframe after another variable in the dataframe corresponding to the url.

4) Move on to the next url ... and so on till all urls are scraped.

Currently, the code is able to scrape from a single url. I have tried to create a function using map from the purrr package. However it doesn't seem to be working, I am doing something wrong.

Here is my attempt, with comments on the purpose of each step

#Load the necessary libraries
devtools::install_github("ropensci/RSelenium")
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
library(xml2)
library(RSelenium)
#To avoid any SSL error messages
library(httr)
set_config( config( ssl_verifypeer = 0L ) )

Defining the URL dataframe

#Now to define the dataframe with the urls
urls_df =data.frame(Name=c("CHKD","AIDHC")
                    ,ID=c("AAWZ12","AAWZ13")
                    ,GooglePlus_URL=c("https://www.google.co.uk/search?ei=fJUKW9DcJuqSgAbPsZ3gDQ&q=Childrens+Hospital+of+the+Kings+Daughter+&oq=Childrens+Hospital+of+the+Kings+Daughter+&gs_l=psy-ab.3..0i13k1j0i22i10i30k1j0i22i30k1l7.8445.8445.0.9118.1.1.0.0.0.0.144.144.0j1.1.0....0...1c.1.64.psy-ab..0.1.143....0.qDMr7IDA-uA#lrd=0x89ba9869b87f1a69:0x384861b1e3a4efd3,1,,,",
                                      "https://www.google.co.uk/search?q=Alfred+I+DuPont+Hospital+for+Children&oq=Alfred+I+DuPont+Hospital+for+Children&aqs=chrome..69i57.341j0j8&sourceid=chrome&ie=UTF-8#lrd=0x89c6fce9425c92bd:0x80e502f2175fb19c,1,,,"
                                      ))

Creating the function

extract_google_review=function(googleplus_urls) {

  #Opens a Chrome session
  rmDr=rsDriver(browser = "chrome",check = F)
  myclient= rmDr$client

  #Creates a sub-dataframe for the filtered hospital, which I will later use to name the dataframe
  urls_df_sub=urls_df %>% filter(GooglePlus_URL %in% googleplus_urls)

  #Navigate to the url
  myclient$navigate(googleplus_urls)

  #click on the snippet to switch focus----------
  webEle <- myclient$findElement(using = "css",value = ".review-snippet")
  webEle$clickElement()
  # Save page source
  pagesource= myclient$getPageSource()[[1]]

  #simulate scroll down for several times-------------
  count=read_html(pagesource) %>%
    html_nodes(".p13zmc") %>%
    html_text()

  #Stores the number of reviews for the url, so we know how many times to scroll down
  scroll_down_times=count %>%
    str_sub(1,nchar(count)-5) %>%
    as.numeric()

  for(i in 1 :scroll_down_times){
    webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
    #the content needs time to load,wait 1.2 second every 5 scroll downs
    if(i%%5==0){
      Sys.sleep(1.2)
    }
  }

  #loop and simulate clicking on all "click on more" elements-------------
  webEles <- myclient$findElements(using = "css",value = ".review-more-link")
  for(webEle in webEles){
    tryCatch(webEle$clickElement(),error=function(e){print(e)})
  }

  pagesource= myclient$getPageSource()[[1]]
  #this should get the full review, including translation and original text
    reviews=read_html(pagesource) %>%
    html_nodes(".review-full-text") %>%
    html_text()

  #number of stars
  stars <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes("g-review-stars > span") %>%
    html_attr("aria-label")

  #time posted
  post_time <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes(".dehysf") %>%
    html_text()

  #Consolidating everything into a dataframe
  reviews=head(reviews,min(length(reviews),length(stars),length(post_time)))
  stars=head(stars,min(length(reviews),length(stars),length(post_time))) 
  post_time=head(post_time,min(length(reviews),length(stars),length(post_time)))
  reviews_df=data.frame(review=reviews,rating=stars,time=post_time)

  #Assign the dataframe a name based on the value in column 'Name' of the dataframe urls_df, defined above
  df_name <- tolower(urls_df_sub$Name)

  if(exists(df_name)) {
    assign(df_name, unique(rbind(get(df_name), reviews_df)))
  } else {
    assign(df_name, reviews_df)
  }


} #End function

Feeding the urls into the function

#Now that the function is defined, it is time to create a vector of urls and feed this vector into the function
googleplus_urls=urls_df$GooglePlus_URL
googleplus_urls %>% map(extract_google_review)

There seems to be an error in the function ,which is preventing it from scraping and storing the data into separate dataframes like intended.

My Intended Output

2 dataframes, each with 3 columns

Any pointers on how this can be improved will be greatly appreciated.

来源:https://stackoverflow.com/questions/50680985/r-scrape-a-list-of-google-urls-using-purrr-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!