How to get rid of the error while scraping web in R?

后端 未结 1 852
谎友^
谎友^ 2021-01-28 21:21

I\'m scraping this website and get an error message is tibble columns must have compatible sizes.
What should I do in this case?

library(rvest)
library(tidy         


        
1条回答
  •  孤独总比滥情好
    2021-01-28 22:13

    The problem is due to every restaurant not having a complete record. In this example the 13th item on the list did not include the price, thus the price vector had 14 items while the place vector had 15 items.

    One way to solve this problem is to find the common parent node and then parse those nodes with the html_node() function. html_node() will always return a value even if it is NA.

    library(rvest)
    library(dplyr)
    library(tibble)
    
    
    url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
    readpage <- function(url){
       #read the page once
       page <-read_html(url)
    
       #parse out the parent nodes
       results <- page %>% html_nodes("article.search-result")
    
       #retrieve the place and price from each parent
       place <- results %>% html_node("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
          html_attr("title")
       price <- results %>% html_node("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
          html_text()
    
       #return a tibble/data,frame
       tibble(url, place, price)
    }
    
    readpage(url)
    

    Also note in your code example above, you were reading the same page multiple times. This is slow and puts additional load on the server. This could be view as a "denial of service" attack.
    It is best to read the page once into memory and then work with that copy.

    Update
    To answer your question concerning multiple pages. Wrap the above function in a lapply function and then bind the list of returned data frames (or tibbles)

    dfs <- lapply(listofurls, function(url){ readpage(url)})
    finalanswer <- bind_rows(dfs)
    

    0 讨论(0)
提交回复
热议问题