I\'m scraping this website and get an error message is tibble columns must have compatible sizes.
What should I do in this case?
library(rvest)
library(tidy
The problem is due to every restaurant not having a complete record. In this example the 13th item on the list did not include the price, thus the price vector had 14 items while the place vector had 15 items.
One way to solve this problem is to find the common parent node and then parse those nodes with the html_node()
function. html_node()
will always return a value even if it is NA.
library(rvest)
library(dplyr)
library(tibble)
url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
readpage <- function(url){
#read the page once
page <-read_html(url)
#parse out the parent nodes
results <- page %>% html_nodes("article.search-result")
#retrieve the place and price from each parent
place <- results %>% html_node("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
html_attr("title")
price <- results %>% html_node("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
html_text()
#return a tibble/data,frame
tibble(url, place, price)
}
readpage(url)
Also note in your code example above, you were reading the same page multiple times. This is slow and puts additional load on the server. This could be view as a "denial of service" attack.
It is best to read the page once into memory and then work with that copy.
Update
To answer your question concerning multiple pages. Wrap the above function in a lapply
function and then bind the list of returned data frames (or tibbles)
dfs <- lapply(listofurls, function(url){ readpage(url)})
finalanswer <- bind_rows(dfs)